# Getting Started with pandas

Pandas contains data structures and data manipulation tools designed to make data cleaning
and analysis fast and convenient in Python. pandas is often used in tandem with
numerical computing tools like NumPy and SciPy, analytical libraries like statsmo‐
dels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts
significant parts of NumPy’s idiomatic style of array-based computing, especially
array-based functions and a preference for data processing without for loops.
While pandas adopts many coding idioms from NumPy, the biggestabout difference
is that pandas is designed for working with tabular or heterogeneous data. NumPy, by
contrast, is best suited for working with homogeneously typed numerical array data.

In [1]:
import pandas as pd

## Introduction to pandas Data Structures


### 1.Series

A Series is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) of the same type and an associated array of data labels,
called its index. The simplest Series is formed from only an array of data:

In [2]:
obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [3]:
obj.array

<PandasArray>
[4, 7, -5, 3]
Length: 4, dtype: int64

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
obj2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"]) #Indexing using given values
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [7]:
#Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values
obj2['a']

-5

In [8]:
obj2[["c", "a", "d"]]
#Here ["c", "a", "d"] is interpreted as a list of indices, even though it contain strings instead of integers

c    3
a   -5
d    4
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean
array, scalar multiplication, or applying math functions, will preserve the index-value
link

In [9]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [10]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [11]:
import numpy as np

In [12]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [13]:
np.cbrt(obj2)

d    1.587401
b    1.912931
a   -1.709976
c    1.442250
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a
mapping of index values to data values. It can be used in many contexts where you
might use a dictionary

In [14]:
"b" in obj2

True

Should you have data contained in a Python dictionary, you can create a Series from
it by passing the dictionary

In [15]:
sdata = {"anantnag":3000, "pulwama":2500, "srinagar":800, "kulgam":2800}

In [16]:
obj3 = pd.Series(sdata)
obj3

anantnag    3000
pulwama     2500
srinagar     800
kulgam      2800
dtype: int64

In [17]:
sdata = obj3.to_dict()
sdata

{'anantnag': 3000, 'pulwama': 2500, 'srinagar': 800, 'kulgam': 2800}

In [18]:
obj4 = pd.Series(sdata, index=states)
obj4

NameError: name 'states' is not defined

### 2.DataFrame

A DataFrame represents a rectangular table of data and contains an ordered, named
collection of columns, each of which can be a different value type (numeric, string,
Boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dictionary of Series all sharing the same i

There are many ways to construct a DataFrame, though one of the most common is
from a dictionary of equal-length lists or NumPy arr

In [20]:
import pandas as pd
data = {"district":["anantnag","anantnag","anantnag","pulwama","pulwama","pulwama"],
        "years":[1991, 2001, 2011, 1991, 2001, 2011],
       "pop":[2.0, 1.5,1.8,0.0,0.8,1.0]}

In [21]:
frame = pd.DataFrame(data)
frame

Unnamed: 0,district,years,pop
0,anantnag,1991,2.0
1,anantnag,2001,1.5
2,anantnag,2011,1.8
3,pulwama,1991,0.0
4,pulwama,2001,0.8
5,pulwama,2011,1.0


In [22]:
#For large DataFrames, the head method selects only the first five rows
frame.head()

Unnamed: 0,district,years,pop
0,anantnag,1991,2.0
1,anantnag,2001,1.5
2,anantnag,2011,1.8
3,pulwama,1991,0.0
4,pulwama,2001,0.8


In [23]:
#Similarly, tail returns the last five rows
frame.tail()

Unnamed: 0,district,years,pop
1,anantnag,2001,1.5
2,anantnag,2011,1.8
3,pulwama,1991,0.0
4,pulwama,2001,0.8
5,pulwama,2011,1.0


In [24]:
#specify a sequence of columns, the DataFrame’s columns will be arranged in that orde

In [25]:
frame2 = pd.DataFrame(data, columns=["years", "district", "pop"])
frame2

Unnamed: 0,years,district,pop
0,1991,anantnag,2.0
1,2001,anantnag,1.5
2,2011,anantnag,1.8
3,1991,pulwama,0.0
4,2001,pulwama,0.8
5,2011,pulwama,1.0


In [26]:
frame2.columns

Index(['years', 'district', 'pop'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by dictionary-like
notation or by using the dot attribute notati

In [27]:
frame2["pop"]

0    2.0
1    1.5
2    1.8
3    0.0
4    0.8
5    1.0
Name: pop, dtype: float64

In [28]:
frame2.years

0    1991
1    2001
2    2011
3    1991
4    2001
5    2011
Name: years, dtype: int64

Rows can also be retrieved by position or name with the special iloc and loc
attributes

In [29]:
frame2.loc[1]

years           2001
district    anantnag
pop              1.5
Name: 1, dtype: object

In [30]:
frame2.iloc[1]

years           2001
district    anantnag
pop              1.5
Name: 1, dtype: object

Columns can be modified by assignment. For example, the empty debt column could
be assigned a scalar value or an array of va

In [31]:
frame2

Unnamed: 0,years,district,pop
0,1991,anantnag,2.0
1,2001,anantnag,1.5
2,2011,anantnag,1.8
3,1991,pulwama,0.0
4,2001,pulwama,0.8
5,2011,pulwama,1.0


In [32]:
frame2["debt"] = 0

In [33]:
frame2

Unnamed: 0,years,district,pop,debt
0,1991,anantnag,2.0,0
1,2001,anantnag,1.5,0
2,2011,anantnag,1.8,0
3,1991,pulwama,0.0,0
4,2001,pulwama,0.8,0
5,2011,pulwama,1.0,0


In [35]:
frame2["debt"] = np.arange(6,)

In [36]:
frame2

Unnamed: 0,years,district,pop,debt
0,1991,anantnag,2.0,0
1,2001,anantnag,1.5,1
2,2011,anantnag,1.8,2
3,1991,pulwama,0.0,3
4,2001,pulwama,0.8,4
5,2011,pulwama,1.0,5


### Index Objects 

pandas’s Index objects are responsible for holding the axis labels (including a Data‐
Frame’s column names) and other metadata (like the axis name or names). Any array
or other sequence of labels you use when constructing a Series or DataFrame is
internally converted to an Ind

In [37]:
obj = pd.Series(np.arange(3), index=["a","b","c"])
obj

a    0
b    1
c    2
dtype: int32

In [38]:
obj.index

Index(['a', 'b', 'c'], dtype='object')

In [39]:
obj.index[1:]

Index(['b', 'c'], dtype='object')

In [41]:
obj.index[1] = "h" #immutable so produces error

TypeError: Index does not support mutable operations

In [42]:
#immutibility makes it faster to transfer its contents
labels = obj.index

In [43]:
labels

Index(['a', 'b', 'c'], dtype='object')

In [44]:
lables[0] = 'h'

NameError: name 'lables' is not defined

In [45]:
frame2.columns

Index(['years', 'district', 'pop', 'debt'], dtype='object')

In [47]:
'pop' in frame2.columns

True

In [50]:
1 in frame2.index

True

## Reindexing 

In [51]:
obj = pd.Series([1.1,2.5,1.8,3.0], index=["a","c","d","e"])
obj

a    1.1
c    2.5
d    1.8
e    3.0
dtype: float64

In [60]:
obj2 = obj.reindex(["a","b","c","d","e","f"])
obj

a    1.1
c    2.5
d    1.8
e    3.0
dtype: float64

In [61]:
"""Calling reindex on this Series rearranges the data according to the new index,
introducing missing values if any index values were not already pr"""
obj2

a    1.1
b    NaN
c    2.5
d    1.8
e    3.0
f    NaN
dtype: float64

In [75]:
"""For ordered data like time series, you may want to do some interpolation or filling of
values when reindexing. The method option allows us to do this, using a method such
as ffill, which forward-fills the value"""
obj3 = pd.Series(["red","yellow","green","blue"], index=[0,2,4,6])
obj3

0       red
2    yellow
4     green
6      blue
dtype: object

In [77]:
obj3.reindex(np.arange(10), method="ffill")
obj3

0       red
2    yellow
4     green
6      blue
dtype: object

### Dropping Entries from an Axis 

Dropping one or more entries from an axis is simple if you already have an index
array or list without those entries, since you can use the reindex method or .loc-
based indexing. As that can require a bit of munging and set logic, the drop method
will return a new object with the indicated value or values deleted from a

In [2]:
import numpy as np
import pandas as pd
obj = pd.Series(np.arange(5.), index=["a", "b", "c", "d", "e"])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [3]:
new_obj = obj.drop("c")
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [6]:
obj.drop(["d", "c"])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis. To illustrate this, we
first create an example DataFra

In [8]:
data = pd.DataFrame(np.arange(16).reshape(4,4), index=["anantnag","pulwama","kulgam","shopian"],
                    columns=["one","two", "three", "four"])

In [9]:
data

Unnamed: 0,one,two,three,four
anantnag,0,1,2,3
pulwama,4,5,6,7
kulgam,8,9,10,11
shopian,12,13,14,15


In [17]:
data.iloc[0]

one      0
two      1
three    2
four     3
Name: anantnag, dtype: int32

In [18]:
data["one"]

anantnag     0
pulwama      4
kulgam       8
shopian     12
Name: one, dtype: int32

In [19]:
data.drop(index=["kulgam","shopian"])

Unnamed: 0,one,two,three,four
anantnag,0,1,2,3
pulwama,4,5,6,7


In [22]:
#To drop labels from the columns, instead use the columns keyword
data.drop(columns=["one"])

Unnamed: 0,two,three,four
anantnag,1,2,3
pulwama,5,6,7
kulgam,9,10,11
shopian,13,14,15


In [24]:
"""you can also drop values from the columns by passing axis=1 (which is like NumPy)
or axis="columns":"""
data.drop("two", axis=1)

Unnamed: 0,one,three,four
anantnag,0,2,3
pulwama,4,6,7
kulgam,8,10,11
shopian,12,14,15


In [25]:
data.drop(["two", "four"], axis="columns")

Unnamed: 0,one,three
anantnag,0,2
pulwama,4,6
kulgam,8,10
shopian,12,14


## Indexing, Selection, and Filtering 

Series indexing (obj[...]) works analogously to NumPy array indexing, except you
can use the Series’s index values instead of only integers. Here are some examples of
this

In [26]:
obj = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])

In [27]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [28]:
obj['a']

0.0

In [29]:
obj[2:3]

c    2.0
dtype: float64

In [32]:
obj[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

While you can select data by label this way, the preferred way to select index values is
with the special loc operator

# Selection on DataFrame with loc and iloc

Like Series, DataFrame has special attributes loc and iloc for label-based and
integer-based indexing, respectively. Since DataFrame is two-dimensional, you can
select a subset of the rows and columns with NumPy-like notation using either axis
labels (loc) or integers (iloc).

In [33]:
obj.loc[["b", "a", "d"]]

b    1.0
a    0.0
d    3.0
dtype: float64

You can combine both row and column selection in loc by separating the selections
with a comm

## Arithmetic and Data Alignment

In [3]:
import numpy as np
import pandas as pd
s1 = pd.Series([1,2,3,4,5], index=["a","b","c","d","e"])
s1

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [4]:
s2 = pd.Series([1.1,2.1,3.1,4.1,5.1])
s2

0    1.1
1    2.1
2    3.1
3    4.1
4    5.1
dtype: float64

In [5]:
s1 + s2

a   NaN
b   NaN
c   NaN
d   NaN
e   NaN
0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
dtype: float64

In [16]:
df1 = pd.Series(np.arange(9.).reshape(3,3), columns=list("bcd"), index=["jammu","kashmir","ladakh"])
df2 = pd.Series(np.arange(12.).reshape(4,3), columns=list("abc"), index=["jammu","punjab","kashmir"])

TypeError: Series.__init__() got an unexpected keyword argument 'columns'