# Introduction to pandas

Pandas is a python data science library  with tabular data. It has more advanced data aggregation and statistical functions.

# Basic data structures

A vector (one dimension) is called a <strong>series</strong>. 
An Array (two dimensional matrix) is called a <strong>DataFrame</strong>.

we want to get a one dimension representation of our variable <strong>data</strong>.

In [3]:
import pandas as pd

In [4]:
data = [100, 200, 300, 400]

In [6]:
data_counts = pd.Series(data, name='count')

In [7]:
data_counts

0    100
1    200
2    300
3    400
Name: count, dtype: int64

it gives the values with indexes. We can change the index to date ranges.

In [11]:
data_counts.index #as you can see index represents a range(0 t0 4)

RangeIndex(start=0, stop=4, step=1)

In [12]:
data_counts.index = [4, 5, 6, 7]

In [14]:
data_counts #index changes

4    100
5    200
6    300
7    400
Name: count, dtype: int64

In [16]:
data_counts.dtypes #to get the type we are using(integers).

dtype('int64')

Convert the integer data type to float. But first we have to import numpy.

In [18]:
import numpy as np

In [19]:
data_counts = data_counts.astype(np.float)

In [20]:
data_counts

4    100.0
5    200.0
6    300.0
7    400.0
Name: count, dtype: float64

To get parts of the series.

In [21]:
data_counts[0:2] #normal list indexing applies here!

4    100.0
5    200.0
Name: count, dtype: float64

Note that we can have dataframes from dictionaries, list, pandas series, etc.

To get a Dataframe(2 dimensional).

In [22]:
data_counts = list(zip(data_counts[0:3], ['car 1', 'car 2', 'car 3']))

In [23]:
data_counts

[(100.0, 'car 1'), (200.0, 'car 2'), (300.0, 'car 3')]

In [26]:
df = pd.DataFrame(data_counts) #to get the dataframe

In [27]:
print(df)

       0      1
0  100.0  car 1
1  200.0  car 2
2  300.0  car 3


## Pandas Dataframe.

To read data frame

In [42]:
#Location of data file.
path = 'Data/Ames_Housing_Sales.csv'

In [43]:
#read the data
data = pd.read_csv(path)

In [45]:
data.head(5) #get the top 5 row of the data.

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


We can add or change column names by setting columnns to a list of what we want.

In [50]:
ac_df = pd.DataFrame([[1,2], [1,3]],columns = ['no 1', 'no 2'])

In [51]:
ac_df

Unnamed: 0,no 1,no 2
0,1,2
1,1,3


To select row, we simply do:

In [53]:
df.iloc[-2] #at index -2 on the row

0      200
1    car 2
Name: 1, dtype: object

For columns, we simply get the names.

get data frame from data.

In [58]:
df = pd.DataFrame(data)

In [59]:
df.head()

Unnamed: 0,1stFlrSF,2ndFlrSF,3SsnPorch,Alley,BedroomAbvGr,BldgType,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,...,ScreenPorch,Street,TotRmsAbvGrd,TotalBsmtSF,Utilities,WoodDeckSF,YearBuilt,YearRemodAdd,YrSold,SalePrice
0,856.0,854.0,0.0,,3,1Fam,TA,No,706.0,0.0,...,0.0,Pave,8,856.0,AllPub,0.0,2003,2003,2008,208500.0
1,1262.0,0.0,0.0,,3,1Fam,TA,Gd,978.0,0.0,...,0.0,Pave,6,1262.0,AllPub,298.0,1976,1976,2007,181500.0
2,920.0,866.0,0.0,,3,1Fam,TA,Mn,486.0,0.0,...,0.0,Pave,6,920.0,AllPub,0.0,2001,2002,2008,223500.0
3,961.0,756.0,0.0,,3,1Fam,Gd,No,216.0,0.0,...,0.0,Pave,7,756.0,AllPub,0.0,1915,1970,2006,140000.0
4,1145.0,1053.0,0.0,,4,1Fam,TA,Av,655.0,0.0,...,0.0,Pave,9,1145.0,AllPub,192.0,2000,2000,2008,250000.0


In [63]:
#to get a specific column
df['1stFlrSF'][0:3]

0     856.0
1    1262.0
2     920.0
Name: 1stFlrSF, dtype: float64

In [67]:
df.Alley[0:5] #for object oriented approach

0    None
1    None
2    None
3    None
4    None
Name: Alley, dtype: object

We can index the data frame, by:


In [69]:
df.iloc[:, 1][0:5] # get 5 values 

0     854.0
1       0.0
2     866.0
3     756.0
4    1053.0
Name: 2ndFlrSF, dtype: float64

There are many funtions that cna be applied to 