# Pandas:
Pandas is a python module that makes data science or data analysis easy and effective.

The Term "Pandas" is derived from "Panel data system" which is an ecometric term for multidimensional,structured data set.

## Features of Pandas
1. It can read or write in many different data formats. 

2. Columns from a Pandas data structure can be deleted or inserted

3. It offers good I/P and O/P capabilities as it easily pulls data from MySql database directly into dataframe

4. It has functionality to find and fill missing data

5. It suppports reshaping of data into different forms.

In [1]:
import numpy as np
import pandas as pd
from io import StringIO, BytesIO

# Series
It is a one-dimensional structure storing homogeneous ,mutable data but immutable size.

It contains a sequence of values and an associated position of data labels called its index, which is by default have numeric data labels starting from zero.

## Creation of Series

In [2]:
# creating empty Series 
s = pd.Series()
print(s)

Series([], dtype: float64)


  s = pd.Series()


### creating series from list

In [4]:
# creating Series by using list
l = [2,4,6,2,4,65,24,67]
s = pd.Series(l, index=[i for i in 'ABCDEFGH'])
print(s)

A     2
B     4
C     6
D     2
E     4
F    65
G    24
H    67
dtype: int64


In [2]:
# creating Series using two different lists
l1 = ['Jan','Feb','Mar','Apr','May']
l2 = [31,29,31,30,31]
s = pd.Series(l2, index=l1)
print(s)

Jan    31
Feb    29
Mar    31
Apr    30
May    31
dtype: int64


In [5]:
# creating Series using range()
s = pd.Series(range(4,9,2))
print(s)

0    4
1    6
2    8
dtype: int64


In [9]:
# creating Series using range() and for loop
s = pd.Series(range(1,11), index = [x for x in range(10,20)])
print(s)

10     1
11     2
12     3
13     4
14     5
15     6
16     7
17     8
18     9
19    10
dtype: int64


In [11]:
# Handling floatig points values in Series
s= pd.Series([1,43,23,1.0,4])
print(s)

0     1.0
1    43.0
2    23.0
3     1.0
4     4.0
dtype: float64


In [14]:
# Creating a Series using missing values NaN
s = pd.Series([2,5,9,np.NaN,232])
print(s)

0      2.0
1      5.0
2      9.0
3      NaN
4    232.0
dtype: float64


### Creating Series from Scalar or constant values

In [7]:
# creating a series from a numerical constant
s =pd.Series(55, index=[100,101,102,103,104])
print(s)

100    55
101    55
102    55
103    55
104    55
dtype: int64


In [8]:
# Series from numerical constant with indexing with range()
s = pd.Series(55, index=range(2,10,2))
print(s)

2    55
4    55
6    55
8    55
dtype: int64


In [16]:
# creating Series using string both as index and constant value
s = pd.Series("Welcome to LPU" , index=['Jaswanth','Yaswanth','Pavan','Praney'])
print(s)

Jaswanth    Welcome to LPU
Yaswanth    Welcome to LPU
Pavan       Welcome to LPU
Praney      Welcome to LPU
dtype: object


### Creating a series from Dictionary
A dictionary can be passed as input and, if no index is specified, the dictionary keys are taken in a sorted order to construct index

In [3]:
# creating sereis from dictionary
s =pd.Series({'Jaswanth':18, 'Yaswanth':19, 'Pavan':17, 'Praney':18})
print(s)

Jaswanth    18
Yaswanth    19
Pavan       17
Praney      18
dtype: int64


In [4]:
# namiong a Series
s =pd.Series({'Jaswanth':18, 'Yaswanth':19, 'Pavan':17, 'Praney':18})
s.name= 'Names'
s.index.name = 'First Name'
print(s)

First Name
Jaswanth    18
Yaswanth    19
Pavan       17
Praney      18
Name: Names, dtype: int64


### Creating a series using a Mathematical Expression/Function
A series of objects can be created by using a Mathematical Expression/Function that determines the value for data sequence using the syntax as follows:

< series Object > = pd.Series(index=None, data=< expression[function] >

In [2]:
# Generate a series using a mathematical expression
n1 = np.arange(10,15)
print(n1)
s = pd.Series( data= n1*8, index=n1)
print(s)

[10 11 12 13 14]
10     80
11     88
12     96
13    104
14    112
dtype: int64


In [19]:
#Generate a series using a mathematical function( exponentiation )
n1 = np.arange(10,15)
print(n1)
s = pd.Series(index=n1, data= n1**2)
print(s)

[10 11 12 13 14]
10    100
11    121
12    144
13    169
14    196
dtype: int64


### Creating Series using Numpy ndArray

In [21]:
# generate a series using a one-dimensional array
arr = np.arange(10,14)
s = pd.Series(arr, index = ['a','b','c','d'])
print(s)

a    10
b    11
c    12
d    13
dtype: int64


## Accessing the data from series

In [5]:
s = pd.Series([1,2,3,4,5,6], index=[x for x in 'abcdef'])
print('s[o] = ',s[0])
print('s[:3] = \n',s[:3])
print('s[-3:] = \n',s[-3:])

s[o] =  1
s[:3] = 
 a    1
b    2
c    3
dtype: int64
s[-3:] = 
 d    4
e    5
f    6
dtype: int64


### iloc[] and loc[] indexing function
1. iloc[] -> iloc is used for indexing or selecting based on position. It doesn't include the last element in indexing.
2. loc[] -> loc is used for indexing or selecting based on names. It include the last element in indexing.

In [6]:
s = pd.Series([1,2,3,4,5,6], index=[x for x in 'abcdef'])
print(s[1:4]) # by using iloc[]
print(s['b':'e']) # by using loc[]

b    2
c    3
d    4
dtype: int64
b    2
c    3
d    4
e    5
dtype: int64


## Sreies object attributes
syntax: < series object >.< Attribute>

In [6]:
s = pd.Series(range(1,15,3), index=[x for x in 'abcde'])
print('len(s) :',len(s)) # len() gives lenght of the series
print('s.index :',s.index) # returns index of the series
print('s.values :',s.values) #returns ndarray
print('s.dtype :',s.dtype) #returns dtype objects of the underlying data
print('s.shape :',s.shape) #return tuple of the shape of underlying data
print('s.nbytes :',s.nbytes) #return number of bytes of underlying data
print('s.ndim :',s.ndim) #returns the number of dimension
print('s.size :',s.size) #return the no.of elements
#print('s.itemsize :', s.itemsize) #return the size of the dtype
print('s.hasnans :',s.hasnans) #return true if there are any NaN
print('s.empty:',s.empty) #return true if series object is empty
print('s.head(nuber) :',s.head(2)) #returns the elements from the starting to the number we given in it
print('s.tail(number) :',s.tail(2)) #returns the elements from the last to the number we given in it

len(s) : 5
s.index : Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s.values : [ 1  4  7 10 13]
s.dtype : int64
s.shape : (5,)
s.nbytes : 40
s.ndim : 1
s.size : 5
s.hasnans : False
s.empty: False
s.head(nuber) : a    1
b    4
dtype: int64
s.tail(number) : d    10
e    13
dtype: int64


## Mathematical Operations on series
Mathematical processing can be performed on series using scalar values and functions.All the arithmetic operators such as +,-,*,/ etc. can be successfully performed on series.One thing we have to remember is that index of the two series should be same.

In [4]:
s = pd.Series(range(1,10))
s1 = pd.Series(range(11,20))
print('s+s1')
print(s+s1)

s+s1
0    12
1    14
2    16
3    18
4    20
5    22
6    24
7    26
8    28
dtype: int64


In [5]:
print('s-s1')
print(s-s1)

s-s1
0   -10
1   -10
2   -10
3   -10
4   -10
5   -10
6   -10
7   -10
8   -10
dtype: int64


In [6]:
print('s*s1')
print(s*s1)

s*s1
0     11
1     24
2     39
3     56
4     75
5     96
6    119
7    144
8    171
dtype: int64


In [7]:
print('s1/s')
print(s1/s)

s1/s
0    11.000000
1     6.000000
2     4.333333
3     3.500000
4     3.000000
5     2.666667
6     2.428571
7     2.250000
8     2.111111
dtype: float64


##  Vector operations on Series
Series also supports vector operations. Any operation to be performed on a series gets performed on every single element of it.

In [9]:
s = pd.Series(range(1,6))
print('s+2 : \n',s+2)
print('s>3 : \n',s>3)

s+2 : 
 0    3
1    4
2    5
3    6
4    7
dtype: int64
s>3 : 
 0    False
1    False
2    False
3     True
4     True
dtype: bool


## Retrieving values using conditions

In [10]:
s = pd.Series([1.00000,1.414214,1.732051,2.000000])
print('s : \n',s)
print('s<2 : \n',s<2)
print('a[a<2]: \n',s[s<2])

s : 
 0    1.000000
1    1.414214
2    1.732051
3    2.000000
dtype: float64
s<2 : 
 0     True
1     True
2     True
3    False
dtype: bool
a[a<2]: 
 0    1.000000
1    1.414214
2    1.732051
dtype: float64


## Deleting an element in Series
We can delete an element from a series using drop() method by passing the index of the element to be deleted as the argument to it.

In [34]:
s = pd.Series(range(100,105))
print('s : \n',s)

d = s.drop(3)
print('The droped element is \n',d)
print('s : \n',s)

s : 
 0    100
1    101
2    102
3    103
4    104
dtype: int64
The droped element is 
 0    100
1    101
2    102
4    104
dtype: int64
s : 
 0    100
1    101
2    102
3    103
4    104
dtype: int64


# Data Frame
It is a two-dimensional structure storing heterogeneous ,mutable data

It is 2-Dimensional data structure, just like any table(with rows and columns).

## Features of dataframe are:
1. Columns can be of different datatype
2. size of dataframe is mutable
3. Its data also mutable
4. Labelled axes - rows/columns
5. Arithmetic operations on rows and columns
6. Indexes may constitue numbers, strings, letters

##  Creation of DataFrame and display

In [7]:
df = pd.DataFrame() #empty dataframe
print(df)

Empty DataFrame
Columns: []
Index: []


DataFrames can be created with the following contructs:
1. Lists
2. Series
3. Dictionary
4. numpy ndarrays

### Creating DataFrame from Lists

In [20]:
# creating a dataframe from list
l1 = [10,20,30,40,50]
df = pd.DataFrame(l1,columns=['Numbers'])
print('df:')
print(df)

df:
   Numbers
0       10
1       20
2       30
3       40
4       50


In [19]:
# creating from nested list
l1 = [['shreya',20],['raj',30],['srijan',50]]
df = pd.DataFrame(l1, columns=['Name','Marks'])
print(df)

     Name  Marks
0  shreya     20
1     raj     30
2  srijan     50


### Creating DataFrame from Series

In [21]:
# creating dataframe from two deries of student data
student_marks = pd.Series({'vijaya':80,'Rajul':92,'Meghna':67,'Radhika':95})
student_age = pd.Series({'vijaya':32,'Rajul': 28,'Meghna':30,'Radhika':20})
df = pd.DataFrame({'Marks':student_marks,'Age':student_age})
print(df)

         Marks  Age
vijaya      80   32
Rajul       92   28
Meghna      67   30
Radhika     95   20


### Creating DataFrame from Dictionary
The different ways to create a dataframe using dictionary are:
1. Dictionary of list
2. Dictionary of series
3. List of dictionaries

#### Dictioanry of list

In [8]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print(df)

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57


In [9]:
# to save the file in 'student.csv'
df.to_csv('student.csv')

#### Dataframe from dictionary of series

In [10]:
n = pd.Series(['Rinku','Ritu','Ajay','Pankaj'])
Eng=pd.Series([67,65,98,45])
Eco = pd.Series([78,59,98,65])
IP = pd.Series([98,97,95,96])
Acc = pd.Series([77,70,80,57])
# creating a dictonary of series
stu = {'Name':n,'English':Eng,'Economics':Eco,'IP':IP,'Accouncts':Acc}
df = pd.DataFrame(stu)
print(df)

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57


#### List of Dictionaries

In [11]:
l1 = [{'Name':'Rinku','English':67,'Eco':78,'IP':98,'Acc':77},
      {'Name':'Ritu','English':65,'Eco':59,'IP':97,'Acc':70}]
df = pd.DataFrame(l1)
print(df)

    Name  English  Eco  IP  Acc
0  Rinku       67   78  98   77
1   Ritu       65   59  97   70


### Creating DataFrame using Numpy ndarray

In [12]:
arr = np.array([[67,78,75,48],[65,98,14,48],[87,14,91,38],[73,79,71,72]])
col_names = ['English','Economics','IP','Accounts']
df = pd.DataFrame(arr , columns = col_names)
print(df)

   English  Economics  IP  Accounts
0       67         78  75        48
1       65         98  14        48
2       87         14  91        38
3       73         79  71        72


## Sorting Data in DataFrames
syntex: < dataframe name>.sort_values(). In this function ,two arguments are passed ,1st: sorting field, 2nd : order of sorting

In [29]:
student_marks = pd.Series({'vijaya':80,'Rajul':92,'Meghna':67,'Radhika':95})
student_age = pd.Series({'vijaya':32,'Rajul': 28,'Meghna':30,'Radhika':20})
df = pd.DataFrame({'Marks':student_marks,'Age':student_age})
print(df)
# print('sorting :')
# print(df.sort_values)
print('Sorting on basis of marks:')
print(df.sort_values(by=['Marks']))

print('Sorting in descending order of marks')
print(df.sort_values(by=['Marks'],ascending = False))

         Marks  Age
vijaya      80   32
Rajul       92   28
Meghna      67   30
Radhika     95   20
Sorting on basis of marks:
         Marks  Age
Meghna      67   30
vijaya      80   32
Rajul       92   28
Radhika     95   20
Sorting in descending order of marks
         Marks  Age
Radhika     95   20
Rajul       92   28
vijaya      80   32
Meghna      67   30


## Attributes of DataFrame
syntax: < DataFrameObject>.< attrribute_name>

In [5]:
dict = {'2018':[85.4,88.2,80.3,79.0],'2019':[77.9,80.5,78.6,76.2],'2020':[86.5,90.0,77.5,80.5]}
df = pd.DataFrame(dict , index=['Accountancy','IP','Economics','English'])
print('df: \n',df)

# index: used to fetch the index's names as the index could be 0,1,2,3...
print('\n df.index :',df.index)

# columns : used to fetch the column's name
print('\n df.columns :',df.columns)

# axes : uesd to fetch both index and column names
print('\n df.axes :',df.axes)

# dtypes : fetch the datatype values of the items in the dataframe
print('\n df.dtypes :',df.dtypes)

# size : fetch the size of the dataframe
print( '\n df.size :',df.size)

#shape : fetch the no.of rows and no.of columns
print('\n df.shape :',df.shape)

# ndim : fetch the dimension of the given dataframe
print('\n df.ndim:',df.ndim)

# empty : gives boolean o/p i.e True or False
print('\n df.empty :',df.empty)

# isna() : checks the presence of NaN(not a number)
print('\n df.isna() :',df.isna())

df: 
              2018  2019  2020
Accountancy  85.4  77.9  86.5
IP           88.2  80.5  90.0
Economics    80.3  78.6  77.5
English      79.0  76.2  80.5

 df.index : Index(['Accountancy', 'IP', 'Economics', 'English'], dtype='object')

 df.columns : Index(['2018', '2019', '2020'], dtype='object')

 df.axes : [Index(['Accountancy', 'IP', 'Economics', 'English'], dtype='object'), Index(['2018', '2019', '2020'], dtype='object')]

 df.dtypes : 2018    float64
2019    float64
2020    float64
dtype: object

 df.size : 12

 df.shape : (4, 3)

 df.ndim: 2

 df.empty : False

 df.isna() :               2018   2019   2020
Accountancy  False  False  False
IP           False  False  False
Economics    False  False  False
English      False  False  False


In [11]:
# count() : gives the count of the items in dataframe
# by default, it gives the count of the rows

print('df.count() :')
print(df.count())

print('\ndf.count(axis = index )')
print(df.count(axis='index'))

print('\ndf.count(number)')
print(df.count(1))

print('\ndf.count(axis = columns )')
print(df.count(axis='columns'))

df.count() :
2018    4
2019    4
2020    4
dtype: int64

df.count(axis = index )
2018    4
2019    4
2020    4
dtype: int64

df.count(number)
Accountancy    3
IP             3
Economics      3
English        3
dtype: int64

df.count(axis = columns )
Accountancy    3
IP             3
Economics      3
English        3
dtype: int64


In [12]:
# Transpose- T
# it makes transpose of the dataframe
dict = {'2018':[85.4,88.2,80.3,79.0],'2019':[77.9,80.5,78.6,76.2],'2020':[86.5,90.0,77.5,80.5]}
df = pd.DataFrame(dict , index=['Accountancy','IP','Economics','English'])
print('df :')
print(df)

print('df.T')
print(df.T)

df :
             2018  2019  2020
Accountancy  85.4  77.9  86.5
IP           88.2  80.5  90.0
Economics    80.3  78.6  77.5
English      79.0  76.2  80.5
df.T
      Accountancy    IP  Economics  English
2018         85.4  88.2       80.3     79.0
2019         77.9  80.5       78.6     76.2
2020         86.5  90.0       77.5     80.5


## Selecting, Accessing , Modifying Records from DataFrame
This can be done by slicing shalls result in the display of retrived records as per the range defined with the dataframe objects.

In [11]:
# indexed dataframe using lists
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu, index = ['Sno1','Sno2','Sno3','Sno4'])

print(df)

        Name  English  Economics  IP  Accouncts
Sno1   Rinku       67         78  98         77
Sno2    Ritu       65         59  97         70
Sno3    Ajay       98         98  95         80
Sno4  Pankaj       45         65  96         57


In [12]:
# To change the index columns
df.set_index('Name',inplace=True)
print(df)

        English  Economics  IP  Accouncts
Name                                     
Rinku        67         78  98         77
Ritu         65         59  97         70
Ajay         98         98  95         80
Pankaj       45         65  96         57


In [13]:
# To reset the index columns
df.reset_index(inplace= True)
print(df)

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57


## Adding/ Modifying a row in DataFrame
We add a row by using DataFrame.loc[ ] method.

1. if we try to add a row with lesser values than the number of columns int the dataframe, it results in valueerror, with error message:
   1. ValueError : Cannot set a row with mismatched columns
2. If ew try to add a column with lesser values than the number of rows in dataframe, it results in ValueError, with the error message:
   2. ValueError: Length of values does not match length of index

In [2]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before modifying the rows')
print(df)
df.loc['5'] = ['Jaswanth',99,98,97,99]
print('After modifying the DataFrame')
print(df)

Before modifying the rows
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
After modifying the DataFrame
       Name  English  Economics  IP  Accouncts
0     Rinku       67         78  98         77
1      Ritu       65         59  97         70
2      Ajay       98         98  95         80
3    Pankaj       45         65  96         57
5  Jaswanth       99         98  97         99


## Renaming Column Name in DataFrame
syntax: df.rename( columns = d, inplace = True )

In [21]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before modifying the rows')
print(df)
df.rename(columns={'Name':'Names','English':'Eng','Economics':'Eco','Accouncts':'Acc'},inplace=True)
print('After modifying the DataFrame')
print(df)

Before modifying the rows
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
After modifying the DataFrame
    Names  Eng  Eco  IP  Acc
0   Rinku   67   78  98   77
1    Ritu   65   59  97   70
2    Ajay   98   98  95   80
3  Pankaj   45   65  96   57


## Adding Columns to a DataFrame
While adding a new column to an already created dataframe, the length of values of new column should match and be equal to length of index column.
* syntax : < dfobject>.< new_col_name>[< row label>] = < new values(s) >

In [23]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before modifying the rows')
print(df)
# adding the col
df['Total'] = df['English']+df['Economics']+df['IP']+df['Accouncts'] 
df['Avg']=df['Total']/4
print('After modifying the DataFrame')
print(df)

Before modifying the rows
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
After modifying the DataFrame
     Name  English  Economics  IP  Accouncts  Total    Avg
0   Rinku       67         78  98         77    320  80.00
1    Ritu       65         59  97         70    291  72.75
2    Ajay       98         98  95         80    371  92.75
3  Pankaj       45         65  96         57    263  65.75


### using insert() method
By using insert() function, we can add a new column to the existing dataframe ar any position column index.
* syntex: < dfobject>.insert(n,< new_col_name>,[data])
, where n: the index of the col where the new column is to be inserted

In [13]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before modifying the rows')
print(df)
df.insert(4,'Qualify',['NO','Yes','NO','Yes'])
print('After modifying the DataFrame')
print(df)

Before modifying the rows
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
After modifying the DataFrame
     Name  English  Economics  IP Qualify  Accouncts
0   Rinku       67         78  98      NO         77
1    Ritu       65         59  97     Yes         70
2    Ajay       98         98  95      NO         80
3  Pankaj       45         65  96     Yes         57


## Selecting a column from a dataframe
Pandas provides three methods to access a dataframe columns
1. df-object['column_name']
2. df-boject.col_name
3. selecting or Accessing rows/columns from dataframe-using loc[ ] and iloc[ ] method
    * loc[ ] : label-based indexing
    * iloc[ ] : index-based indexing

###  By using df-object['Column_name']

In [3]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print(df)
print('df["IP"]')
print(df['IP'])

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
df["IP"]
0    98
1    97
2    95
3    96
Name: IP, dtype: int64


### By using df-object.Column_name

In [5]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print(df)
print('df."IP"')
print(df.IP)

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
df."IP"
0    98
1    97
2    95
3    96
Name: IP, dtype: int64


### By using loc[ ]  and iloc[ ]

In [2]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print(df)
# df.iloc[row-indexes , column-indexes]
print('\ndf.iloc[:,[1,2]]')
print(df.iloc[:,[1,2]])

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57

df.iloc[:,[1,2]]
   English  Economics
0       67         78
1       65         59
2       98         98
3       45         65


In [3]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print(df)

print('\ndf.loc[:,[1,2]]')
print(df.loc[1:3])

     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57

df.loc[:,[1,2]]
     Name  English  Economics  IP  Accouncts
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57


## Deleting a Column from DataFrame
This can be done by using
1. del keyword : It will simply delete the series from the dataframe
2. pop() : will delete the series and also return the series as reslut
3. drop(): Syntex- drop(labels, axis=1)

### del keyword

In [24]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)
del df['IP']
print('After using del df["IP"]')
print(df)

Before
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
After using del df["IP"]
     Name  English  Economics  Accouncts
0   Rinku       67         78         77
1    Ritu       65         59         70
2    Ajay       98         98         80
3  Pankaj       45         65         57


###  pop() method

In [29]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)
print('\nUsing df.pop("IP")')
print(df.pop('IP'))
print('\nAfter using del df["IP"]')
print(df)

Before
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57

Using df.pop("IP")
0    98
1    97
2    95
3    96
Name: IP, dtype: int64

After using del df["IP"]
     Name  English  Economics  Accouncts
0   Rinku       67         78         77
1    Ritu       65         59         70
2    Ajay       98         98         80
3  Pankaj       45         65         57


### drop( )

In [2]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)
df.drop('IP', axis=1)
print('\nAfter using df.drop("IP",axis=1)')
print(df)

Before
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57

After using df.drop("IP",axis=1)
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57


##  Iterations in DataFrame - Iterrows
Accessing and Retrieving each record one by one in a dataframe. 
1. < DFOject>.iterrows() - It represents dataframe row-wise, record by record.
2. < DFOject>.iteritems() - It represents dataframe column-wise.

###   iterrows()

In [4]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)

for (row,rowSeries) in df.iterrows():
    print("RowIndex :",row)
    print("Containing :")
    print(rowSeries)

Before
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
RowIndex : 0
Containing :
Name         Rinku
English         67
Economics       78
IP              98
Accouncts       77
Name: 0, dtype: object
RowIndex : 1
Containing :
Name         Ritu
English        65
Economics      59
IP             97
Accouncts      70
Name: 1, dtype: object
RowIndex : 2
Containing :
Name         Ajay
English        98
Economics      98
IP             95
Accouncts      80
Name: 2, dtype: object
RowIndex : 3
Containing :
Name         Pankaj
English          45
Economics        65
IP               96
Accouncts        57
Name: 3, dtype: object


### iteritems( )

In [4]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,45],
       'Economics':[78,59,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[77,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)

# displaying column-wise data
for (col,colSeries) in df.iteritems():
    print("ColIndex :",col)
    print("Containing :")
    print(colSeries)

Before
     Name  English  Economics  IP  Accouncts
0   Rinku       67         78  98         77
1    Ritu       65         59  97         70
2    Ajay       98         98  95         80
3  Pankaj       45         65  96         57
ColIndex : Name
Containing :
0     Rinku
1      Ritu
2      Ajay
3    Pankaj
Name: Name, dtype: object
ColIndex : English
Containing :
0    67
1    65
2    98
3    45
Name: English, dtype: int64
ColIndex : Economics
Containing :
0    78
1    59
2    98
3    65
Name: Economics, dtype: int64
ColIndex : IP
Containing :
0    98
1    97
2    95
3    96
Name: IP, dtype: int64
ColIndex : Accouncts
Containing :
0    77
1    70
2    80
3    57
Name: Accouncts, dtype: int64


## 2.12) Binary Operations
The methods are like add(), sub(), mul(), div(), and related function radd(), rsub() for carrying out binary operations on dataframe.

In [8]:
# Binary operations on Dataframes
std1 = {"Unit test 1":[5,6,8,3,10],"Unit test 2":[7,8,9,6,15]}
std2 = {"Unit test 1":[3,3,6,5,8],"Unit test 2":[5,9,8,10,5]}
ds1 = pd.DataFrame(std1)
ds2 = pd.DataFrame(std2)
print("ds1 :")
print(ds1)
print("ds2 :")
print(ds2)
print("Subtraction : ds1.sub(ds2)")
print(ds1.sub(ds2))
print(" ds1.rsub(ds2)")
print(ds1.rsub(ds2))
print("Addition : ds1.add(ds2)")
print(ds1.add(ds2))
print(" ds1.radd(ds2)")
print(ds1.radd(ds2))
print("Multiplication : ds1.mul(ds2)")
print(ds1.mul(ds2))
print(" Division: ds1.div(ds2)")
print(ds1.div(ds2))

ds1 :
   Unit test 1  Unit test 2
0            5            7
1            6            8
2            8            9
3            3            6
4           10           15
ds2 :
   Unit test 1  Unit test 2
0            3            5
1            3            9
2            6            8
3            5           10
4            8            5
Subtraction : ds1.sub(ds2)
   Unit test 1  Unit test 2
0            2            2
1            3           -1
2            2            1
3           -2           -4
4            2           10
 ds1.rsub(ds2)
   Unit test 1  Unit test 2
0           -2           -2
1           -3            1
2           -2           -1
3            2            4
4           -2          -10
Addition : ds1.add(ds2)
   Unit test 1  Unit test 2
0            8           12
1            9           17
2           14           17
3            8           16
4           18           20
 ds1.radd(ds2)
   Unit test 1  Unit test 2
0            8           12
1          

## 2.13)  Missing Data and Filling Values
The values with no computational significance are called missing values. It represented ad NaN i.e Not a Number. This missing values can be filled by using fillna( ) method.

In [18]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,np.NaN],
       'Economics':[78,np.NaN,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[np.NaN,70,80,57]}
df = pd.DataFrame(stu)
print('Before')
print(df)

print('After using fillna(0)')
print(df.fillna(0))

# Replacing constants values with Column-wise
print("df.fillna({0:14, 1:5, 2:8}) ")
print(df.fillna({'English':14, 'IP':5, 'Economics':8}))

Before
     Name  English  Economics  IP  Accouncts
0   Rinku     67.0       78.0  98        NaN
1    Ritu     65.0        NaN  97       70.0
2    Ajay     98.0       98.0  95       80.0
3  Pankaj      NaN       65.0  96       57.0
After using fillna(0)
     Name  English  Economics  IP  Accouncts
0   Rinku     67.0       78.0  98        0.0
1    Ritu     65.0        0.0  97       70.0
2    Ajay     98.0       98.0  95       80.0
3  Pankaj      0.0       65.0  96       57.0
df.fillna({0:14, 1:5, 2:8}) 
     Name  English  Economics  IP  Accouncts
0   Rinku     67.0       78.0  98        NaN
1    Ritu     65.0        8.0  97       70.0
2    Ajay     98.0       98.0  95       80.0
3  Pankaj     14.0       65.0  96       57.0


## 2.14) Combining DataFrame
Pandas provides the capability to combine or concatenate two or more dataframes on the basis of rows(row-wise) or columns(colun-wise) using concate() method.

### 2.14.1) Concatenation in DataFrame
syn: pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False)
* objs - This is a sequence or mapping of Series, Dataframe, or panel objects.
* axis - {0,1,2,...}, default 0
* join - {'inner','outer'}, default 'outer'
* join_axes - This is the list of index objects.Specific indexes to use for the other (n-1) axes instead of performing inner/outer set logic
* ignore_index - boolean, default False. If True, do not use the index values on the concatenation axis.The resulting axis will be labelled 0,...,n-1

In [2]:
d1 = {"rollno":[10,11,12,13,14,15],"Name":['Rinku','Ankit','Ram','Joe','Hari','Teja']}
df1 = pd.DataFrame(d1)
d2 = {"rollno":[1,2,3,4,5,6],"Name":['Raj','shiva','Pavan','Pranay','Yash','kumar']}
df2 = pd.DataFrame(d2)

print('df1 :')
print(df1)
print('df2 :')
print(df2)
# axis = 0 concat along rows
print('pd.concat([df1,df2],axis=0) :')
print(pd.concat([df1,df2],axis=0))
# axis = 1 concat along columns
print('pd.concat([df1,df2],axis=1) :')
print(pd.concat([df1,df2],axis=1))

df1 :
   rollno   Name
0      10  Rinku
1      11  Ankit
2      12    Ram
3      13    Joe
4      14   Hari
5      15   Teja
df2 :
   rollno    Name
0       1     Raj
1       2   shiva
2       3   Pavan
3       4  Pranay
4       5    Yash
5       6   kumar
pd.concat([df1,df2],axis=0) :
   rollno    Name
0      10   Rinku
1      11   Ankit
2      12     Ram
3      13     Joe
4      14    Hari
5      15    Teja
0       1     Raj
1       2   shiva
2       3   Pavan
3       4  Pranay
4       5    Yash
5       6   kumar
pd.concat([df1,df2],axis=1) :
   rollno   Name  rollno    Name
0      10  Rinku       1     Raj
1      11  Ankit       2   shiva
2      12    Ram       3   Pavan
3      13    Joe       4  Pranay
4      14   Hari       5    Yash
5      15   Teja       6   kumar


### 2.14.2) Merge operation
Pandas provide a single function, merge( ), as the empty point for all standard database join operations between dataframe objects.

In [4]:
d1 = {'roll_no':[10,11,12,13,14,15],'name':['Ankit','Pihu','Rinku','Yash','Vujay','Nikhil']}
d2 = {'roll_no':[20,21,22,23,24,25],'name':['Shaurya','Pinky','Anubhav','Khushi','Vinay','Neetu']}
d3 = {'roll_no':[10,21,25,23,20,13],'name':['Jeet','Ashima','Shivin','Kiran','Tanmay','Rajat']}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df12 = pd.concat([df1,df2])
df3 = pd.DataFrame(d3)
df34 = pd.merge(df12,df3, on='roll_no')
print("df1: \n",df1)
print("df2: \n",df2)
print("df12: \n",df12)
print("df3: \n",df3)
print("df34: \n",df34)

df1: 
    roll_no    name
0       10   Ankit
1       11    Pihu
2       12   Rinku
3       13    Yash
4       14   Vujay
5       15  Nikhil
df2: 
    roll_no     name
0       20  Shaurya
1       21    Pinky
2       22  Anubhav
3       23   Khushi
4       24    Vinay
5       25    Neetu
df12: 
    roll_no     name
0       10    Ankit
1       11     Pihu
2       12    Rinku
3       13     Yash
4       14    Vujay
5       15   Nikhil
0       20  Shaurya
1       21    Pinky
2       22  Anubhav
3       23   Khushi
4       24    Vinay
5       25    Neetu
df3: 
    roll_no    name
0       10    Jeet
1       21  Ashima
2       25  Shivin
3       23   Kiran
4       20  Tanmay
5       13   Rajat
df34: 
    roll_no   name_x  name_y
0       10    Ankit    Jeet
1       13     Yash   Rajat
2       20  Shaurya  Tanmay
3       21    Pinky  Ashima
4       23   Khushi   Kiran
5       25    Neetu  Shivin


## 2.15) Converting DataFrame into array

In [11]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,np.NaN],
       'Economics':[78,np.NaN,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[np.NaN,70,80,57]}
df = pd.DataFrame(stu)
# convert dataframe into array values
print(df.iloc[:,1:].values)
print('\n',df.iloc[:,1:].values.shape)

[[67. 78. 98. nan]
 [65. nan 97. 70.]
 [98. 98. 95. 80.]
 [nan 65. 96. 57.]]

 (4, 4)


# 3) CSV files and DataFrame
CSV: comma separated values, is a simple file format used to store tabular data such as a spreadsheet or database. A CSV file sotres tabular data(number and text) in plain text.

In CSV format:
* Each row of the table is stored in one row
* The field-values of a row are stored together with comma after every field value.

## 3.1) Advantages of CSV format:
* A simple, compact and ubiquitous format for data storage.
* A common format for data interchange.
* It can be opened in popular spreadsheet package like MS excel, open office-Calc, etc.
* Nearly all spreadsheets and databases support import/export to CSV format.

## 3.2) Reading from CSV file to DataFrame
By using read_csv() function in pandas can help in load the data in a pandas DataFrame
* Note: CSV  treats all the data types as characters only. But Pandas interprets these data types specifically when loading the data.

Syntax: < df>=pd.read_csv(< filePath>)

In [17]:
# select the proper path of your file
df = pd.read_csv("./Employee.csv")
print(df)
print("\ndf.shape",df.shape)

   Empid     Name   Age       City   Salary
0  100.0   Ritesh  25.0     Mumbai  15000.0
1  101.0   Aakash  26.0        Goa  16000.0
2    NaN      NaN   NaN        NaN      NaN
3  102.0   Mahima  27.0  Hyderabad  20000.0
4  103.0  Lakshay  25.0      Delhi  18000.0
5  104.0     Manu  23.0     Mumbai  25000.0
6  105.0    Nidhi  26.0      Delhi      NaN
7  106.0    Geetu  30.0  Bangalore  28000.0

df.shape (8, 5)


In [18]:
# shows a describing of the csv file
df.describe()

Unnamed: 0,Empid,Age,Salary
count,7.0,7.0,6.0
mean,103.0,26.0,20333.333333
std,2.160247,2.160247,5163.977795
min,100.0,23.0,15000.0
25%,101.5,25.0,16500.0
50%,103.0,26.0,19000.0
75%,104.5,26.5,23750.0
max,106.0,30.0,28000.0


### 3.2.1) Reading CSV file with specific/selected columns
Syn: df = pd.read_csv("FilePath" , usecols=[col names])

In [4]:
df = pd.read_csv("./Employee.csv", usecols=['Age','Empid'])
print(df)

   Empid   Age
0  100.0  25.0
1  101.0  26.0
2    NaN   NaN
3  102.0  27.0
4  103.0  25.0
5  104.0  23.0
6  105.0  26.0
7  106.0  30.0


### 3.2.2) Reading CSV file with specific/selected rows
Syn: df=read_csv('FilePath', nrows= < num >)

In [6]:
df = pd.read_csv("./Employee.csv", nrows=3)
print(df)

   Empid    Name   Age    City   Salary
0  100.0  Ritesh  25.0  Mumbai  15000.0
1  101.0  Aakash  26.0     Goa  16000.0
2    NaN     NaN   NaN     NaN      NaN


### 3.2.3) Reading CSV file without header
If we don't want any headers then we can remove that by using "header=None" or skiprows=1 in read_csv() function

In [7]:
df = pd.read_csv("./Employee.csv", header = None)
print(df)

       0        1    2          3       4
0  Empid     Name  Age       City  Salary
1    100   Ritesh   25     Mumbai   15000
2    101   Aakash   26        Goa   16000
3    NaN      NaN  NaN        NaN     NaN
4    102   Mahima   27  Hyderabad   20000
5    103  Lakshay   25      Delhi   18000
6    104     Manu   23     Mumbai   25000
7    105    Nidhi   26      Delhi     NaN
8    106    Geetu   30  Bangalore   28000


### 3.2.4) Reading CSV file without index
This can be done by specifying the attribute "index_col=0" in read_csv() function

In [55]:
# making 1st col as row-index
df = pd.read_csv("./Employee.csv", index_col=0)
print(df)

          Name   Age       City   Salary
Empid                                   
100.0   Ritesh  25.0     Mumbai  15000.0
101.0   Aakash  26.0        Goa  16000.0
NaN        NaN   NaN        NaN      NaN
102.0   Mahima  27.0  Hyderabad  20000.0
103.0  Lakshay  25.0      Delhi  18000.0
104.0     Manu  23.0     Mumbai  25000.0
105.0    Nidhi  26.0      Delhi      NaN
106.0    Geetu  30.0  Bangalore  28000.0


### 3.2.5) Reading CSV file with new column names
If hearder exists, you have to skip it using "skiprows=1" along with "names" option for renaming the cols.

In [9]:
df = pd.read_csv("./Employee.csv", skiprows=1, 
                 names=['E-id','E_Name','E_age','E_City','E_Salary'])
print(df)

    E-id   E_Name  E_age     E_City  E_Salary
0  100.0   Ritesh   25.0     Mumbai   15000.0
1  101.0   Aakash   26.0        Goa   16000.0
2    NaN      NaN    NaN        NaN       NaN
3  102.0   Mahima   27.0  Hyderabad   20000.0
4  103.0  Lakshay   25.0      Delhi   18000.0
5  104.0     Manu   23.0     Mumbai   25000.0
6  105.0    Nidhi   26.0      Delhi       NaN
7  106.0    Geetu   30.0  Bangalore   28000.0


### 3.2.6) Get the unique category counts

In [24]:
df = pd.read_csv("./Employee.csv")
print(df)
print("\n df['Age'].value_counts() : \n",df['Age'].value_counts())

   Empid     Name   Age       City   Salary
0  100.0   Ritesh  25.0     Mumbai  15000.0
1  101.0   Aakash  26.0        Goa  16000.0
2    NaN      NaN   NaN        NaN      NaN
3  102.0   Mahima  27.0  Hyderabad  20000.0
4  103.0  Lakshay  25.0      Delhi  18000.0
5  104.0     Manu  23.0     Mumbai  25000.0
6  105.0    Nidhi  26.0      Delhi      NaN
7  106.0    Geetu  30.0  Bangalore  28000.0

 df['Age'].value_counts() : 
 25.0    2
26.0    2
27.0    1
23.0    1
30.0    1
Name: Age, dtype: int64


## 3.3) Updating/Modifying contents in a CSV file
If we want to replace data of csv file with NaN, This can be done by using "na_values" option along with read_csv method where "na_values=< the value we want to change>" .

In [10]:
df = pd.read_csv("./Employee.csv", na_values=[16000])
print(df)

   Empid     Name   Age       City   Salary
0  100.0   Ritesh  25.0     Mumbai  15000.0
1  101.0   Aakash  26.0        Goa      NaN
2    NaN      NaN   NaN        NaN      NaN
3  102.0   Mahima  27.0  Hyderabad  20000.0
4  103.0  Lakshay  25.0      Delhi  18000.0
5  104.0     Manu  23.0     Mumbai  25000.0
6  105.0    Nidhi  26.0      Delhi      NaN
7  106.0    Geetu  30.0  Bangalore  28000.0


## 3.4) Writing a CSV file with default index
To create a CSV file from DataFrame, to_csv() method is used.

### 3.4.1) Copying Employee.csv to Empnew.csv
BY using to_csv()

In [4]:
# reading csv file
df = pd.read_csv("./Employee.csv") 
# copying df to new csv file
df.to_csv("./EmpNew.csv")
print(df)

   Empid     Name   Age       City   Salary
0  100.0   Ritesh  25.0     Mumbai  15000.0
1  101.0   Aakash  26.0        Goa  16000.0
2    NaN      NaN   NaN        NaN      NaN
3  102.0   Mahima  27.0  Hyderabad  20000.0
4  103.0  Lakshay  25.0      Delhi  18000.0
5  104.0     Manu  23.0     Mumbai  25000.0
6  105.0    Nidhi  26.0      Delhi      NaN
7  106.0    Geetu  30.0  Bangalore  28000.0


### 3.4.2) Saving DataFrame as csv file
This can be done by creating dataframe and saving to a location in system

In [5]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,np.NaN],
       'Economics':[78,np.NaN,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[np.NaN,70,80,57]}
df = pd.DataFrame(stu) # creating dataframe
print(df)
df.to_csv("./Student.csv")

     Name  English  Economics  IP  Accouncts
0   Rinku     67.0       78.0  98        NaN
1    Ritu     65.0        NaN  97       70.0
2    Ajay     98.0       98.0  95       80.0
3  Pankaj      NaN       65.0  96       57.0


In [12]:
df = pd.read_csv("./Student.csv")
print("Before: \n",df)
# to avoid this "Unnamed:0" use index_col=0
df = pd.read_csv("./Student.csv", index_col=0)
print("After: \n",df)

Before: 
    Unnamed: 0    Name  IP
0           0   Rinku  98
1           1    Ritu  97
2           2    Ajay  95
3           3  Pankaj  96
After: 
      Name  IP
0   Rinku  98
1    Ritu  97
2    Ajay  95
3  Pankaj  96


### 3.4.3) Copying fileds into a new file

In [14]:
stu = {'Name':['Rinku','Ritu','Ajay','Pankaj'],
       'English':[67,65,98,np.NaN],
       'Economics':[78,np.NaN,98,65],
       'IP':[98,97,95,96],
       'Accouncts':[np.NaN,70,80,57]}
df = pd.DataFrame(stu) # creating dataframe
print(df)
df.to_csv(".//Student.csv", columns=["Name","IP"])

     Name  English  Economics  IP  Accouncts
0   Rinku     67.0       78.0  98        NaN
1    Ritu     65.0        NaN  97       70.0
2    Ajay     98.0       98.0  95       80.0
3  Pankaj      NaN       65.0  96       57.0


## 3.5) Creating a CSV file by data

In [25]:
from io import StringIO, BytesIO

In [29]:
data = ('col1,col2,col3\n'
       'x,y,1\n'
       'a,b,2\n'
       'c,d,3')
type(data)

str

In [30]:
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,x,y,1
1,a,b,2
2,c,d,3


### 3.5.1) Read from specific columns

In [37]:
data = ('col1,col2,col3\n'
       'x,y,1\n'
       'a,b,2\n'
       'c,d,3')
df = pd.read_csv(StringIO(data), usecols=['col1','col3'])
# df = pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1','COL3'])
df

Unnamed: 0,col1,col3
0,x,1
1,a,2
2,c,3


### 3.5.2) Creating CSV by specifying the datatype

data = ('a,b,c,d\n'
       '1,2,3,4\n'
       '5,6,7,8\n'
       '9,10,11')
df = pd.read_csv(StringIO(data), dtype=object)
print(df)
print("df['a'][1]) :", df['a'][1])
print(type(df['a'][1]))

In [6]:
data = ('a,b,c,d\n'
       '1,2,3,4\n'
       '5,6,7,8\n'
       '9,10,11')
df = pd.read_csv(StringIO(data), dtype={'a':object, 'b':int, 'c':float})
print(df)
print("df['b'][1]) :", df['b'][1])
print(type(df['b'][1]))

   a   b     c    d
0  1   2   3.0  4.0
1  5   6   7.0  8.0
2  9  10  11.0  NaN
df['b'][1]) : 6
<class 'numpy.int64'>


In [7]:
# Quoting and Escape Characters, very useful in NLP
data = 'a,b\n"hello, \\"Bob\\",nice to see you",5'
df = pd.read_csv(StringIO(data),escapechar='\\')
df

Unnamed: 0,a,b
0,"hello, ""Bob"",nice to see you",5


### 3.5.3) URL to CSV

In [12]:
df = pd.read_csv('https://download.bls.gov/pub/time.series/cu/cu.item', sep='\t')
df.head()

Unnamed: 0,item_code,item_name,display_level,selectable,sort_sequence
0,AA0,All items - old base,0,T,2
1,AA0R,Purchasing power of the consumer dollar - old ...,0,T,400
2,SA0,All items,0,T,1
3,SA0E,Energy,1,T,375
4,SA0L1,All items less food,1,T,359


# 4) Json with pandas

## 4.1) Read Json to CSV

In [10]:
data = '{"employee_name":"James","email":"james@gmail.com","job_profile":[{"title1":"Team Lead","title2":"Sr. Developer"}]}'
pd.read_json(data)

Unnamed: 0,employee_name,email,job_profile
0,James,james@gmail.com,"{'title1': 'Team Lead', 'title2': 'Sr. Develop..."


In [16]:
# reading json file into csv 
# note: it can only read nicly if the data is in key:value pair
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


## 4.2) convert Json to different formats

In [17]:
# saving the file into wine.csv
df.to_csv('wine.csv')

In [19]:
# convert json to different json formats
df.to_json(orient='index')

'{"0":{"0":1,"1":14.23,"2":1.71,"3":2.43,"4":15.6,"5":127,"6":2.8,"7":3.06,"8":0.28,"9":2.29,"10":5.64,"11":1.04,"12":3.92,"13":1065},"1":{"0":1,"1":13.2,"2":1.78,"3":2.14,"4":11.2,"5":100,"6":2.65,"7":2.76,"8":0.26,"9":1.28,"10":4.38,"11":1.05,"12":3.4,"13":1050},"2":{"0":1,"1":13.16,"2":2.36,"3":2.67,"4":18.6,"5":101,"6":2.8,"7":3.24,"8":0.3,"9":2.81,"10":5.68,"11":1.03,"12":3.17,"13":1185},"3":{"0":1,"1":14.37,"2":1.95,"3":2.5,"4":16.8,"5":113,"6":3.85,"7":3.49,"8":0.24,"9":2.18,"10":7.8,"11":0.86,"12":3.45,"13":1480},"4":{"0":1,"1":13.24,"2":2.59,"3":2.87,"4":21.0,"5":118,"6":2.8,"7":2.69,"8":0.39,"9":1.82,"10":4.32,"11":1.04,"12":2.93,"13":735},"5":{"0":1,"1":14.2,"2":1.76,"3":2.45,"4":15.2,"5":112,"6":3.27,"7":3.39,"8":0.34,"9":1.97,"10":6.75,"11":1.05,"12":2.85,"13":1450},"6":{"0":1,"1":14.39,"2":1.87,"3":2.45,"4":14.6,"5":96,"6":2.5,"7":2.52,"8":0.3,"9":1.98,"10":5.25,"11":1.02,"12":3.58,"13":1290},"7":{"0":1,"1":14.06,"2":2.15,"3":2.61,"4":17.6,"5":121,"6":2.6,"7":2.51,"8":0.3

# 5) HTML by pandas

## 5.1)Reading html content

In [26]:
url = 'https://www.fdic.gov/bank/individual/failed/banklist.html'

dfs = pd.read_html(url)

ImportError: html5lib not found, please install it

In [27]:
dfs[0]

NameError: name 'dfs' is not defined

In [32]:
url_mcc = 'https://en.wikipedia.org/wiki/Mobile_country_code'
dfs = pd.read_html(url_mcc, match='Mobile network codes',header=0)
dfs[0]

Unnamed: 0,Mobile country code,Country,ISO 3166,Mobile network codes,National MNC authority,Remarks
0,289,A Abkhazia,GE-AB,List of mobile network codes in Abkhazia,,MCC is not listed by ITU
1,412,Afghanistan,AF,List of mobile network codes in Afghanistan,,
2,276,Albania,AL,List of mobile network codes in Albania,,
3,603,Algeria,DZ,List of mobile network codes in Algeria,,
4,544,American Samoa (United States of America),AS,List of mobile network codes in American Samoa,,
...,...,...,...,...,...,...
247,452,Vietnam,VN,List of mobile network codes in the Vietnam,,
248,543,W Wallis and Futuna,WF,List of mobile network codes in Wallis and Futuna,,
249,421,Y Yemen,YE,List of mobile network codes in the Yemen,,
250,645,Z Zambia,ZM,List of mobile network codes in Zambia,,
