<a href="https://colab.research.google.com/github/mkmritunjay/machineLearning/blob/master/pandas_basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Basics

Pandas is a software library in python for data manipulation and analysis. The name is derived from the term "panel data".

The two primary data structures of pandas are Series (1-dimensional) and DataFrame (2-dimensional).

### Series:

***pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)***



Its a one dimentional ndarray with axis labels (including time series).

Lets create a series and do some basic operations.

In [32]:
import pandas as pd

list_1 = [1,2,3,4,5]
tuple_1 = (1,2,3,4,5)

series_1 = pd.Series(data=list_1)
series_2 = pd.Series(data=tuple_1)

print("List Series:")
print(series_1)
print("\nTuple Series:\n")
print(series_2)

List Series:
0    1
1    2
2    3
3    4
4    5
dtype: int64

Tuple Series:

0    1
1    2
2    3
3    4
4    5
dtype: int64


### Dataframe:

***pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)***

Its a two-dimensional mutable, heterogeneous tabular data structure with labeled axes (rows and columns).

Lets create a dataframe and do some basic operations.

In [33]:
df_dict_1 = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
             {'a': 10, 'b': 20, 'c': 30, 'd': 40},
              {'a': 100, 'b': 200, 'c': 300, 'd': 400},
                {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]

df_1 = pd.DataFrame(df_dict_1)

df_1

Unnamed: 0,a,b,c,d
0,1,2,3,4
1,10,20,30,40
2,100,200,300,400
3,1000,2000,3000,4000


#### Selecting rows and columns of a dataframe:

loc and iloc are widely used to fetch rows and columns from a dataframe.

iloc is purely integer-location based indexing for selection by position. While selection through iloc start and (stop-1) are fetched. Below are some of the allowed inputs for iloc:

1. An integer, (series.iloc[5])

2. A list or array of integers, (series.iloc[[0,1]])

3. A slice object with ints, (series.iloc[1:5])

4. A boolean array, (series.iloc[True,False,True])

In [34]:
# selection using an integer
print("using integers")
print(df_1.iloc[0]) # this fetches all columns of row 1

using integers
a    1
b    2
c    3
d    4
Name: 0, dtype: int64


In [35]:
#to fetch specific columns use comma"," (left side of comma specify rows index and right side columns index)
print("\nspecific columns using integers\n")
print(df_1.iloc[0,1])


specific columns using integers

2


In [36]:
# selection using an array of integers
print("\nspecific columns using arrays of integers\n")
print(df_1.iloc[[0,1],[2,3]])


specific columns using arrays of integers

    c   d
0   3   4
1  30  40


In [37]:
# slice object with ints
print("\nslice\n")
print(df_1.iloc[0:2,1:3])


slice

    b   c
0   2   3
1  20  30


In [38]:
# some more examples
print(df_1.iloc[0:2,[0,3]])
print("\n\n")
print(df_1.iloc[[0,2],0:3])
print("\n\n")
print(df_1['a'])
print("\n\n")
print(df_1[['a','b']])

    a   d
0   1   4
1  10  40



     a    b    c
0    1    2    3
2  100  200  300



0       1
1      10
2     100
3    1000
Name: a, dtype: int64



      a     b
0     1     2
1    10    20
2   100   200
3  1000  2000


loc is purely label-location based indexer for selection by label. While selection through loc both the start and the stop are fetched. Below are some of the allowed inputs for loc:

1. A single label, e.g. (series.loc[5]) or (series.loc['label']), (note that 5 is interpreted as a label of the index, and never as an integer position along the index).

2. A list or array of labels, (series.loc['a', 'b', 'c']).

3. A slice object with labels, (series.loc['a':'f']).

4. A boolean array of the same length as the axis being sliced, (series.loc[True, False, True].)

In [39]:
# selection using a single label
print(df_1.loc[:,'a']) # this fetches all rows of column 'a'

0       1
1      10
2     100
3    1000
Name: a, dtype: int64


In [40]:
# selection using array of labels
print(df_1.loc[:,['a','b']]) # this fetches all rows of column 'a' and 'b'

      a     b
0     1     2
1    10    20
2   100   200
3  1000  2000


In [41]:
# selecting a slice
print(df_1.loc[2:3,'a':'c']) # this fetches row 2 to 3 and columns 'a' to 'c'

      a     b     c
2   100   200   300
3  1000  2000  3000


### Adding new column (Single)

In [42]:
df_1['diff'] = df_1['a'] - df_1['b']
df_1

Unnamed: 0,a,b,c,d,diff
0,1,2,3,4,-1
1,10,20,30,40,-10
2,100,200,300,400,-100
3,1000,2000,3000,4000,-1000


In [43]:
df_1['new_diff'] = df_1.a - df_1.c
df_1

Unnamed: 0,a,b,c,d,diff,new_diff
0,1,2,3,4,-1,-2
1,10,20,30,40,-10,-20
2,100,200,300,400,-100,-200
3,1000,2000,3000,4000,-1000,-2000


### Adding new column (Multiple)

In [44]:
df_1 = df_1.assign(add_mul1=df_1['a'] + df_1['b'], add_mul2=df_1['a'] + df_1['c'])
df_1

Unnamed: 0,a,b,c,d,diff,new_diff,add_mul1,add_mul2
0,1,2,3,4,-1,-2,3,4
1,10,20,30,40,-10,-20,30,40
2,100,200,300,400,-100,-200,300,400
3,1000,2000,3000,4000,-1000,-2000,3000,4000


### Renaming columns

In [45]:
df_1 = df_1.rename(columns={'diff':'renamed_1','new_diff':'renamed_2','add_mul1':'renamed_3','add_mul2':'renamed_4'})
df_1

Unnamed: 0,a,b,c,d,renamed_1,renamed_2,renamed_3,renamed_4
0,1,2,3,4,-1,-2,3,4
1,10,20,30,40,-10,-20,30,40
2,100,200,300,400,-100,-200,300,400
3,1000,2000,3000,4000,-1000,-2000,3000,4000


### Delete a column

In [46]:
del df_1['renamed_4']
df_1

Unnamed: 0,a,b,c,d,renamed_1,renamed_2,renamed_3
0,1,2,3,4,-1,-2,3
1,10,20,30,40,-10,-20,30
2,100,200,300,400,-100,-200,300
3,1000,2000,3000,4000,-1000,-2000,3000
