In [1]:
import pandas as pd 
import numpy as np 

### Pandas is the most handy and one of the most useful tool for data analysis. 
Advantages:
    1. Has a lot more function than excel 
    2. NO limitation like in excel sheet 
    3. It uses the speed of numpy 
    4. Has function for doing almost anything 

Difference between Dataframe and Series -
1. Series- It is a 1-D array to store one row or one column of a Dataframe. It can store anytime of values 
2. Dataframe- It is a tabular spreadsheet structure and represents all the rows and columns in the excel sheet.

## Basics of Pandas 

In [2]:
dictionary= {
    "name":['hskjhd','sjdf','jksdfh','hsjdfhj'],
    "marks":[50,70,60,90],
    "school":["shdfj","jhdfjk","shfj","hsldfhj"]
    
}

In [3]:
# This function is used to convert a dictionary or anything to dataframe
df = pd.DataFrame(dictionary)

In [4]:
df

Unnamed: 0,name,marks,school
0,hskjhd,50,shdfj
1,sjdf,70,jhdfjk
2,jksdfh,60,shfj
3,hsjdfhj,90,hsldfhj


In [5]:
#This function is used to give the statictical description of the data 
df.describe()

Unnamed: 0,marks
count,4.0
mean,67.5
std,17.078251
min,50.0
25%,57.5
50%,65.0
75%,75.0
max,90.0


In [6]:
df.dtypes

name      object
marks      int64
school    object
dtype: object

In [7]:
df.T

Unnamed: 0,0,1,2,3
name,hskjhd,sjdf,jksdfh,hsjdfhj
marks,50,70,60,90
school,shdfj,jhdfjk,shfj,hsldfhj


In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
type(df['name'])

pandas.core.series.Series

## Accessing and Indexing 

In [10]:
# Native accessors
df.name

0     hskjhd
1       sjdf
2     jksdfh
3    hsjdfhj
Name: name, dtype: object

In [11]:
df['name']

0     hskjhd
1       sjdf
2     jksdfh
3    hsjdfhj
Name: name, dtype: object

In [12]:
df['name'][2]

'jksdfh'

#### Index Based selection 
There are two paradigms for index based selection:
    1. iloc - selecting data based on its numerical position in the data.
    2. loc - Selecting data based on its label in the data 

In [13]:
df = pd.DataFrame({'a':np.random.rand(10),
                 'b':['London','Paris','New York','Istanbul',
                      'Liverpool','Berlin',np.nan,'Madrid',
                      'Rome',np.nan],
                 'c':[True,True,True,False,False,np.nan,np.nan,
                      False,True,True],
                 'd':[3,4,5,1,5,2,2,np.nan,np.nan,0],
                 'e':[1,4,5,3,3,3,3,8,8,4]})
df

Unnamed: 0,a,b,c,d,e
0,0.789053,London,True,3.0,1
1,0.421165,Paris,True,4.0,4
2,0.889637,New York,True,5.0,5
3,0.378589,Istanbul,False,1.0,3
4,0.179021,Liverpool,False,5.0,3
5,0.85692,Berlin,,2.0,3
6,0.582409,,,2.0,3
7,0.111936,Madrid,False,,8
8,0.527731,Rome,True,,8
9,0.695706,,True,0.0,4


#### Index based selection - iloc

In [14]:
df.iloc[0]

a    0.789053
b      London
c        True
d           3
e           1
Name: 0, dtype: object

In [15]:
df.iloc[:,0]

0    0.789053
1    0.421165
2    0.889637
3    0.378589
4    0.179021
5    0.856920
6    0.582409
7    0.111936
8    0.527731
9    0.695706
Name: a, dtype: float64

In [16]:
df.iloc[5,1]

'Berlin'

In [17]:
df.iloc[:6,1]

0       London
1        Paris
2     New York
3     Istanbul
4    Liverpool
5       Berlin
Name: b, dtype: object

In [18]:
df.iloc[4,2:]

c    False
d        5
e        3
Name: 4, dtype: object

In [19]:
df.iloc[[0,7,8],:2]

Unnamed: 0,a,b
0,0.789053,London
7,0.111936,Madrid
8,0.527731,Rome


In [20]:
df.iloc[-10,1]

'London'

#### Label based selection - loc

In [21]:
df.loc[:,'b']

0       London
1        Paris
2     New York
3     Istanbul
4    Liverpool
5       Berlin
6          NaN
7       Madrid
8         Rome
9          NaN
Name: b, dtype: object

In [22]:
df.loc[:4,'b']

0       London
1        Paris
2     New York
3     Istanbul
4    Liverpool
Name: b, dtype: object

In [23]:
df.loc[3:8,'b']

3     Istanbul
4    Liverpool
5       Berlin
6          NaN
7       Madrid
8         Rome
Name: b, dtype: object

#### Choosing between loc and iloc 

iloc uses the Python stdlib indexing scheme, where the first element of the range is included and the last one excluded. So 0:10 will select entries 0,...,9. loc, meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example. If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to select "all the alphabetical fruit choices between Apples and Potatoes", then it's a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index something like df.loc['Apples', 'Potatoet] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list, e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

## Manipulating the Index

In [26]:
df.index

RangeIndex(start=0, stop=10, step=1)

In [28]:
df['Title'] = ["city1","city2","city3","city4","city5","city6","city7","city 8","city9","city10"]

In [29]:
df.head()

Unnamed: 0,a,b,c,d,e,Title
0,0.789053,London,True,3.0,1,city1
1,0.421165,Paris,True,4.0,4,city2
2,0.889637,New York,True,5.0,5,city3
3,0.378589,Istanbul,False,1.0,3,city4
4,0.179021,Liverpool,False,5.0,3,city5


In [30]:
df.set_index("Title")

Unnamed: 0_level_0,a,b,c,d,e
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
city1,0.789053,London,True,3.0,1
city2,0.421165,Paris,True,4.0,4
city3,0.889637,New York,True,5.0,5
city4,0.378589,Istanbul,False,1.0,3
city5,0.179021,Liverpool,False,5.0,3
city6,0.85692,Berlin,,2.0,3
city7,0.582409,,,2.0,3
city 8,0.111936,Madrid,False,,8
city9,0.527731,Rome,True,,8
city10,0.695706,,True,0.0,4


## Conditional selection 

In [34]:
df.b == 'Madrid'

0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8    False
9    False
Name: b, dtype: bool

In [36]:
df.e == 3

0    False
1    False
2    False
3     True
4     True
5     True
6     True
7    False
8    False
9    False
Name: e, dtype: bool

In [37]:
df.loc[df.e == 3]

Unnamed: 0,a,b,c,d,e,Title
3,0.378589,Istanbul,False,1.0,3,city4
4,0.179021,Liverpool,False,5.0,3,city5
5,0.85692,Berlin,,2.0,3,city6
6,0.582409,,,2.0,3,city7


In [49]:
df.loc[df.b.isin(['Berlin', 'Liverpool'])]

Unnamed: 0,a,b,c,d,e,Title
4,0.179021,Liverpool,False,5.0,3,city5
5,0.85692,Berlin,,2.0,3,city6


In [51]:
df.isnull().sum()

a        0
b        2
c        2
d        2
e        0
Title    0
dtype: int64

In [54]:
df.loc[df.isnull().sum()]

Unnamed: 0,a,b,c,d,e,Title
0,0.789053,London,True,3.0,1,city1
2,0.889637,New York,True,5.0,5,city3
2,0.889637,New York,True,5.0,5,city3
2,0.889637,New York,True,5.0,5,city3
0,0.789053,London,True,3.0,1,city1
0,0.789053,London,True,3.0,1,city1


In [55]:
df.loc[df.c.isnull().sum()]

a        0.889637
b        New York
c            True
d               5
e               5
Title       city3
Name: 2, dtype: object

### Some Basic Function 

In [56]:
#Describe function - gives the details about mean, ,median, mode, percentile and many more statistical terms
df.e.describe()

count    10.000000
mean      4.200000
std       2.250926
min       1.000000
25%       3.000000
50%       3.500000
75%       4.750000
max       8.000000
Name: e, dtype: float64

In [59]:
# unique function - Lists the unique elements in the data set
df.b.unique()

array(['London', 'Paris', 'New York', 'Istanbul', 'Liverpool', 'Berlin',
       nan, 'Madrid', 'Rome'], dtype=object)

In [61]:
# Value Counts - To see a list of unique values and how often they occur in the dataset
df.e.value_counts()

3    4
8    2
4    2
5    1
1    1
Name: e, dtype: int64

#### Maps
Map is used to take a set of value and map it into a different set of values.


In [63]:
df.e

0    1
1    4
2    5
3    3
4    3
5    3
6    3
7    8
8    8
9    4
Name: e, dtype: int64

In [62]:
df_e_mean = df.e.mean()
df.e.map(lambda p: p - df_e_mean)

0   -3.2
1   -0.2
2    0.8
3   -1.2
4   -1.2
5   -1.2
6   -1.2
7    3.8
8    3.8
9   -0.2
Name: e, dtype: float64

The function you pass to map() should expect a single value from the Series (a point value, in the above example), 
and return a transformed version of that value. map() returns a new Series where all the values have been transformed 
by your function. Map does not modify the original dataset and returns new transformed series 
