# Introduction to pandas

Pandas is a python data science library  with tabular data. It has more advanced data aggregation and statistical functions.

# Basic data structures

A vector (one dimension) is called a <strong>series</strong>. 
An Array (two dimensional matrix) is called a <strong>DataFrame</strong>.

we want to get a one dimension representation of our variable <strong>data</strong>.

In [2]:
import pandas as pd

In [3]:
data = [100, 200, 300, 400]

In [4]:
data_counts = pd.Series(data, name='count')

In [5]:
data_counts

0    100
1    200
2    300
3    400
Name: count, dtype: int64

it gives the values with indexes. We can change the index to date ranges.

In [6]:
data_counts.index #as you can see index represents a range(0 t0 4)

RangeIndex(start=0, stop=4, step=1)

In [7]:
data_counts.index = [4, 5, 6, 7]

In [8]:
data_counts #index changes

4    100
5    200
6    300
7    400
Name: count, dtype: int64

In [9]:
data_counts.dtypes #to get the type we are using(integers).

dtype('int64')

Convert the integer data type to float. But first we have to import numpy.

In [10]:
import numpy as np

In [11]:
data_counts = data_counts.astype(np.float)

In [12]:
data_counts

4    100.0
5    200.0
6    300.0
7    400.0
Name: count, dtype: float64

To get parts of the series.

In [13]:
data_counts[0:2] #normal list indexing applies here!

4    100.0
5    200.0
Name: count, dtype: float64

Note that we can have dataframes from dictionaries, list, pandas series, etc.

To get a Dataframe(2 dimensional).

In [14]:
data_counts = list(zip(data_counts[0:3], ['car 1', 'car 2', 'car 3']))

In [15]:
data_counts

[(100.0, 'car 1'), (200.0, 'car 2'), (300.0, 'car 3')]

In [16]:
df = pd.DataFrame(data_counts) #to get the dataframe

In [17]:
print(df)

       0      1
0  100.0  car 1
1  200.0  car 2
2  300.0  car 3


## Pandas Dataframe.

To read data frame

In [18]:
#Location of data file.
path = 'Data/Ames_Housing_Sales.csv'

In [19]:
#read the data
data = pd.read_csv(path)

In [20]:
data.head(5) #get the top 5 row of the data.

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


We can add or change column names by setting columnns to a list of what we want.

In [21]:
ac_df = pd.DataFrame([[1,2], [1,3]],columns = ['no 1', 'no 2'])

In [22]:
ac_df

Unnamed: 0,no 1,no 2
0,1,2
1,1,3


To select row, we simply do:

In [23]:
df.iloc[-2] #at index -2 on the row

0      200
1    car 2
Name: 1, dtype: object

For columns, we simply get the names.

get data frame from data.

In [24]:
df = pd.DataFrame(data)

In [25]:
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


In [26]:
#to get a specific column
df['1stFlrSF'][0:3]

0     856.0
1    1262.0
2     920.0
Name: 1stFlrSF, dtype: float64

In [27]:
df.Alley[0:5] #for object oriented approach

0    None
1    None
2    None
3    None
4    None
Name: Alley, dtype: object

We can index the data frame, by:


In [28]:
df.iloc[:, 1][0:5] # get 5 values 

0     854.0
1       0.0
2     866.0
3     756.0
4    1053.0
Name: 2ndFlrSF, dtype: float64

There are many funtions that can be applied to the dataframe to get more information.

Learn more at the pandas documentation <a href='http://pandas.pydata.org/pandas-docs/stable/'>click here</a>

 To concatenate dataframes:

In [29]:
new_data = pd.concat([df.iloc[:], data[:]])

In [30]:
new_data.head() #to see the result

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


## Aggregated Statistics(Group by)

In [33]:
group_sizes = (new_data.groupby('BsmtExposure') .size()) #used fo categorizing data
print(group_sizes)

BsmtExposure
Av       338
Gd       234
Mn       170
No      1164
None     852
dtype: int64


# Statistic calculations

In [41]:
df.mean() #mean calculations

1stFlrSF           1177.129804
2ndFlrSF            353.424946
3SsnPorch             3.609862
BedroomAbvGr          2.865120
BsmtFinSF1          455.578680
BsmtFinSF2           48.102248
BsmtFullBath          0.430747
BsmtHalfBath          0.058738
BsmtUnfSF           570.765047
EnclosedPorch        21.039159
Fireplaces            0.641769
FullBath              1.580131
GarageArea          500.762146
GarageCars            1.870921
GarageYrBlt        1978.506164
GrLivArea          1534.689630
HalfBath              0.395939
KitchenAbvGr          1.038434
LotArea           10695.812183
LotFrontage          75.112314
LowQualFinSF          4.134880
MSSubClass           56.022480
MasVnrArea          108.364757
MiscVal              42.889050
MoSold                6.334300
OpenPorchSF          47.276287
OverallCond           5.577955
OverallQual           6.187092
PoolArea              2.920957
ScreenPorch          15.945613
TotRmsAbvGrd          6.552574
TotalBsmtSF        1074.445975
WoodDeck

In [42]:
df.describe() #to get the whole view of the calculations.

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,BedroomAbvGr,BsmtFinSF1,BsmtFinSF2,BsmtFullBath,BsmtHalfBath,BsmtUnfSF,EnclosedPorch,...,OverallQual,PoolArea,ScreenPorch,TotRmsAbvGrd,TotalBsmtSF,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
count,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,...,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0,1379.0
mean,1177.129804,353.424946,3.609862,2.86512,455.57868,48.102248,0.430747,0.058738,570.765047,21.039159,...,6.187092,2.920957,15.945613,6.552574,1074.445975,97.456853,1972.958666,1985.435098,2007.812183,185479.51124
std,387.014961,439.553171,30.154682,0.783961,459.691379,164.324665,0.514052,0.238285,443.677845,60.535107,...,1.34578,41.335545,57.249593,1.589821,436.371874,126.699192,29.379883,20.444852,1.330221,79023.8906
min,438.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,3.0,0.0,0.0,1880.0,1950.0,2006.0,35311.0
25%,894.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,228.0,0.0,...,5.0,0.0,0.0,5.0,810.0,0.0,1955.0,1968.0,2007.0,134000.0
50%,1098.0,0.0,0.0,3.0,400.0,0.0,0.0,0.0,476.0,0.0,...,6.0,0.0,0.0,6.0,1008.0,0.0,1976.0,1994.0,2008.0,167500.0
75%,1414.0,738.5,0.0,3.0,732.0,0.0,1.0,0.0,811.0,0.0,...,7.0,0.0,0.0,7.0,1314.0,171.0,2001.0,2004.0,2009.0,217750.0
max,4692.0,2065.0,508.0,6.0,5644.0,1474.0,2.0,2.0,2336.0,552.0,...,10.0,738.0,480.0,12.0,6110.0,857.0,2010.0,2010.0,2010.0,755000.0


# Sampling data:

In [43]:
# Sample 5 rows without replacement 
sample = (data .sample(n=5, replace=False, random_state=42))
print(sample.iloc[:,-3:])


     YearRemodAdd  YrSold  SalePrice
599          2004    2008   274000.0
881          1965    2009   117500.0
634          1950    2006    87000.0
425          1997    2007   204000.0
906          2003    2007   185000.0


This is just the basics, you can also use the <a href="https://docs.scipy.org/doc/scipy/reference/api.html"><strong> Scipy library </strong></a> for more statistical calculations.