# Intro to Pandas

[Data Science Handbook (with notebooks!)](https://jakevdp.github.io/PythonDataScienceHandbook/)

[Basics of Pandas](https://towardsdatascience.com/6-basic-pandas-techniques-you-need-to-know-2c5725746938)

[Pandas cheat sheet](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjajKXO09DlAhWKqIsKHRK1Ab4QFjAAegQIARAC&url=https%3A%2F%2Fpandas.pydata.org%2FPandas_Cheat_Sheet.pdf&usg=AOvVaw2Z0H-ttrFe-41ta-Cnkf55)

[Good about rows and columns](https://www.geeksforgeeks.org/dealing-with-rows-and-columns-in-pandas-dataframe/)

Pandas is a python library for data manipulation. A Pandas *DataFrame* is a table with rows and columns. There is typically one data point per row and several features (columns) for each data point.

In [1]:
import pandas as pd
from sklearn import datasets

## Converting from format X to DataFrame

### List to DataFrame

In [2]:
num_list = [1,2,3,4,5]
df = pd.DataFrame(num_list)
df

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5


In [3]:
num_list = [(1,2),(3,4),(5,3)]
df = pd.DataFrame(num_list)
df

Unnamed: 0,0,1
0,1,2
1,3,4
2,5,3


### Dictionary to DataFrame

In [4]:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


### Text file (with spaces) to DataFrame

In [5]:
df = pd.read_fwf('../datasets/xy.txt')
df

Unnamed: 0,x y
0,10.0 8.04
1,8.0 6.95
2,13.0 7.58
3,9.0 8.81
4,11.0 8.33
5,14.0 9.96
6,6.0 7.24
7,4.0 4.26
8,12.0 10.84
9,7.0 4.82


### Csv to DataFrame

In [6]:
# Read a csv file from the web
df = pd.read_csv('http://bit.ly/autompg-csv') # Miles per gallon
df.head()

Unnamed: 0,mpg,cyl,displ,hp,weight,accel,yr,origin,name
0,18.0,8,307.0,130,3504,12.0,70,US,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,US,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,US,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,US,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,US,ford torino


In [7]:
df = pd.read_csv('../datasets/GDP-2015.csv')
df.head()

Unnamed: 0,Entity,Code,Year,GDP per capita
0,Afghanistan,AFG,2015,1928
1,Albania,ALB,2015,10947
2,Algeria,DZA,2015,13024
3,Angola,AGO,2015,8631
4,Argentina,ARG,2015,19316


### Excel to DataFrame

In [8]:
# .xls works, but xlsx does not work!
df = pd.read_excel('../datasets/friends.xls')
df

Unnamed: 0,Name,Age,Gender
0,Siri,15,f
1,Laura,6,f
2,Oscar,5,m


### Scikit datasets to DataFrame

In [9]:
iris = datasets.load_iris()
# type(iris)

In [10]:
iris_features = pd.DataFrame(iris.data, columns = iris.feature_names)
iris_features.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [11]:
iris_target = pd.DataFrame(iris.target)
iris_target.head()

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0


## Looking at DataFrames

In [12]:
df = pd.read_csv('../datasets/GDP-2015.csv')

In [13]:
df.head() #first 5 rows

Unnamed: 0,Entity,Code,Year,GDP per capita
0,Afghanistan,AFG,2015,1928
1,Albania,ALB,2015,10947
2,Algeria,DZA,2015,13024
3,Angola,AGO,2015,8631
4,Argentina,ARG,2015,19316


In [14]:
df.tail(3)  #last 3 rows

Unnamed: 0,Entity,Code,Year,GDP per capita
164,Yemen,YEM,2015,2496
165,Zambia,ZMB,2015,3537
166,Zimbabwe,ZWE,2015,1759


In [15]:
df.describe()

Unnamed: 0,Year,GDP per capita
count,167.0,167.0
mean,2015.0,18216.598802
std,0.0,19305.364946
min,2015.0,605.0
25%,2015.0,3705.0
50%,2015.0,11738.0
75%,2015.0,25843.0
max,2015.0,139542.0


## Working with DataFrames

Grab a column:

In [16]:
df.columns

Index(['Entity', 'Code', 'Year', 'GDP per capita'], dtype='object')

In [17]:
countries = df['Entity']
countries

0      Afghanistan
1          Albania
2          Algeria
3           Angola
4        Argentina
          ...     
162      Venezuela
163        Vietnam
164          Yemen
165         Zambia
166       Zimbabwe
Name: Entity, Length: 167, dtype: object

In [18]:
gdp = df['GDP per capita']
gdp

0       1928
1      10947
2      13024
3       8631
4      19316
       ...  
162    16257
163     5733
164     2496
165     3537
166     1759
Name: GDP per capita, Length: 167, dtype: int64

Grab an entry:

In [19]:
gdp = df['GDP per capita']
gdpAngola = df['GDP per capita'][3]
gdpAngola

8631

## Small example

In [20]:
friends= pd.read_excel('../datasets/friends.xls')
friends['Name']

0     Siri
1    Laura
2    Oscar
Name: Name, dtype: object

Add a column:

In [21]:
friends['HasBike'] = True
friends.head()

Unnamed: 0,Name,Age,Gender,HasBike
0,Siri,15,f,True
1,Laura,6,f,True
2,Oscar,5,m,True


Save changes to a file:

In [22]:
friends.to_excel('../datasets/friends_extended.xls')