### **_Pandas_**
#### Pandas is an open source library which provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas has a lot of functions that will help in reading and writing data and also for data manipulation. Thus we will be using pandas throughout the course.


In [1]:
#Import Pandas
import pandas as pd

#Loading data with read_csv() function. Here we are providing path to the csv file. 
#If the file is in your system you can provide its path as well.
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

#Let's print and see its type
print(type(iris))


<class 'pandas.core.frame.DataFrame'>


### Pandas Dataframes
#### DataFrame is an object for data manipulation. You can think of it as a 2D tabular structure, where every row is a dataset entry and columns represents features of data.


### Creating copy of DataFrame
    df = iris 
#### Above statement simply makes df refer to the data frame object that iris is referring to. So now both iris and df refer to the same dataframe object and any changes done via one will reflect in other.
#### So effectively this is not creating another dataframe object. And if we wish to create a copy then we will use copy() function for that
    df = iris.copy()

In [2]:
#Ignoring header -> If you don't want first row to be treated as a header, you can set header = None
iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", header=None)
iris

Unnamed: 0,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [3]:
df = iris.copy()
print(df) 
#Iris data has a total of 150 rows, where the first row is treated as a header

       0    1    2    3               4
0    5.1  3.5  1.4  0.2     Iris-setosa
1    4.9  3.0  1.4  0.2     Iris-setosa
2    4.7  3.2  1.3  0.2     Iris-setosa
3    4.6  3.1  1.5  0.2     Iris-setosa
4    5.0  3.6  1.4  0.2     Iris-setosa
..   ...  ...  ...  ...             ...
145  6.7  3.0  5.2  2.3  Iris-virginica
146  6.3  2.5  5.0  1.9  Iris-virginica
147  6.5  3.0  5.2  2.0  Iris-virginica
148  6.2  3.4  5.4  2.3  Iris-virginica
149  5.9  3.0  5.1  1.8  Iris-virginica

[150 rows x 5 columns]


In [4]:
# head() function is a good way to have a look at first few enteries of the data frame. By default if shows first 5 enteries. 
#You can write df.head(n) if you wish to have a look at first n enteries.
#df.head()

In [5]:
#Column Headers
df.columns  # Will tell you what the current column headers are
#Changing Column Headers
df.columns = ['sl','sw','pl','pw','flower_type']


In [6]:
#shape tells us about the shape of the structure.
print(df.shape)
#dtypes will tell you type of data each column is storing
print(df.dtypes)

(150, 5)
sl             float64
sw             float64
pl             float64
pw             float64
flower_type     object
dtype: object


In [7]:
#describe() function is a good way to get some statistics around the data. By default it includes all columns with numeric data.
#Also the results exclude NaN enteries
print(df.describe())
#If you want to include all columns for result, then use -
df.describe(include='all')


               sl          sw          pl          pw
count  150.000000  150.000000  150.000000  150.000000
mean     5.843333    3.054000    3.758667    1.198667
std      0.828066    0.433594    1.764420    0.763161
min      4.300000    2.000000    1.000000    0.100000
25%      5.100000    2.800000    1.600000    0.300000
50%      5.800000    3.000000    4.350000    1.300000
75%      6.400000    3.300000    5.100000    1.800000
max      7.900000    4.400000    6.900000    2.500000


Unnamed: 0,sl,sw,pl,pw,flower_type
count,150.0,150.0,150.0,150.0,150
unique,,,,,3
top,,,,,Iris-setosa
freq,,,,,50
mean,5.843333,3.054,3.758667,1.198667,
std,0.828066,0.433594,1.76442,0.763161,
min,4.3,2.0,1.0,0.1,
25%,5.1,2.8,1.6,0.3,
50%,5.8,3.0,4.35,1.3,
75%,6.4,3.3,5.1,1.8,


In [8]:
#Accessing a particular column.
#df.column_name or df['column_name']lets you access a particular column
df.sl  #OR
df['sl']
#You can also call describe() on a particular column
df.sl.describe()
#OR
df['sl'].describe()

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: sl, dtype: float64

In [9]:
#Null enteries
#df.isnull() will give you list of null enteries
print(df.isnull())
#Although df.isnull() in itself might not be very useful, we can use df.isnull().sum() which gives us count of number of null enteries in every column.
df.isnull().sum()

        sl     sw     pl     pw  flower_type
0    False  False  False  False        False
1    False  False  False  False        False
2    False  False  False  False        False
3    False  False  False  False        False
4    False  False  False  False        False
..     ...    ...    ...    ...          ...
145  False  False  False  False        False
146  False  False  False  False        False
147  False  False  False  False        False
148  False  False  False  False        False
149  False  False  False  False        False

[150 rows x 5 columns]


sl             0
sw             0
pl             0
pw             0
flower_type    0
dtype: int64

In [10]:
#Selecting Rows and/or columns by position
#iloc helps us in achieving this.
#Selecting first 4 rows of data
df.iloc[:4,:]
#Selecting data for first 4 rows, and for first 2 columns 
df.iloc[:4,:2]
#Selecting data for rows 2 to 5, and for columns 1 to 3
df.iloc[2:6,1:4]

Unnamed: 0,sw,pl,pw
2,3.2,1.3,0.2
3,3.1,1.5,0.2
4,3.6,1.4,0.2
5,3.9,1.7,0.4
