## CS 210 Spring 2024 - Mar 21
### Pandas

In [1]:
from pandas import Series
from pandas import DataFrame
import numpy as np

---

#### <font color="brown">Creating DataFrames - Continued</font>

**3. Creating a DataFrame from a 2D NumPy array**

In [2]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

Unnamed: 0,0,1
0,0.338492,0.344256
1,0.956191,0.345336
2,0.409539,0.84134


**Change index and column names**

In [3]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

Unnamed: 0,first,second
one,0.338492,0.344256
two,0.956191,0.345336
three,0.409539,0.84134


**Or set them up at creation time**

In [4]:
randdf = DataFrame(rand2d, index=['one', 'two', 'three'],
                   columns = ['first', 'second'])
randdf

Unnamed: 0,first,second
one,0.338492,0.344256
two,0.956191,0.345336
three,0.409539,0.84134


---

#### <font color="brown">Columns</font>

**Membership**

In [5]:
popdat = {'state': ['Arizona','Arizona','Arizona','Virginia','Virginia'],
          'year': [2005, 2010, 2015, 2010, 2015],
          'pop': [5.9, 6.6, 6.8, 7.9, 8.3]}
popdf = DataFrame(popdat)
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [6]:
'debt' in popdf.columns  

False

**Each column is a Series**

**Column can be referenced by using column name as index into dataframe**

In [7]:
print(popdf['state'],'\n')
print(popdf['state'].name, '\n')
print(popdf['state'].values, '\n')
print(popdf['state'].index, '\n')

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object 

state 

['Arizona' 'Arizona' 'Arizona' 'Virginia' 'Virginia'] 

RangeIndex(start=0, stop=5, step=1) 



**Alternatively, a column can be referenced as an attribute of the dataframe**

In [8]:
popdf.state

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

**Can get at a subset of columns with list, similar to rows of ndarray or index of Series**

In [9]:
popdf[['state','pop']]

Unnamed: 0,state,pop
0,Arizona,5.9
1,Arizona,6.6
2,Arizona,6.8
3,Virginia,7.9
4,Virginia,8.3


**Changing column names**

In [10]:
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [11]:
popdf.columns = ['year','state','pop']
popdf

Unnamed: 0,year,state,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


<font color="red">**Warning: Changing column names assigns new names, does NOT rearrange!**</font>

In [12]:
# restore to original
popdf.columns = ['state','year','pop']
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


---

#### <font color="brown">Indexing and Manipulating rows and columns</font>

**Row indexing by position, using loc**

In [13]:
popdf.loc[1]

state    Arizona
year        2010
pop          6.6
Name: 1, dtype: object

**Row of a DataFrame is a Series**

In [14]:
print(popdf.loc[1].name)
print(popdf.loc[1].values)

1
['Arizona' 2010 6.6]


**Range of rows**

In [15]:
popdf.loc[1:3]

Unnamed: 0,state,year,pop
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9


**<font color="red">Note above, end value of range of rows is INCLUSIVE!</font>**

**Subset of rows, subset of columns**

In [16]:
popdf.loc[[0,2],['state','pop']]  

Unnamed: 0,state,pop
0,Arizona,5.9
2,Arizona,6.8


**Compare the above to the usage in cell 10:**

<pre>
   popdf[['state','pop']]
</pre>

**which gave the *columns* 'state' and 'pop' because we weren't using the loc feature**

**Adding a column**

In [17]:
# assign same value to all rows in the column
popdf['debt'] = 1.5
popdf

Unnamed: 0,state,year,pop,debt
0,Arizona,2005,5.9,1.5
1,Arizona,2010,6.6,1.5
2,Arizona,2015,6.8,1.5
3,Virginia,2010,7.9,1.5
4,Virginia,2015,8.3,1.5


In [18]:
# Assign different value for each row
popdf['debt'] = np.arange(1,6)
popdf

Unnamed: 0,state,year,pop,debt
0,Arizona,2005,5.9,1
1,Arizona,2010,6.6,2
2,Arizona,2015,6.8,3
3,Virginia,2010,7.9,4
4,Virginia,2015,8.3,5


In [19]:
popdat2 = {'Arizona': {2005: 5.9, 2010: 6.6, 2015: 6.8},
           'Virginia': {2010: 7.9, 2015: 8.3}}
popdf2 = DataFrame(popdat2)
popdf2

Unnamed: 0,Arizona,Virginia
2005,5.9,
2010,6.6,7.9
2015,6.8,8.3


In [20]:
# Different value for each row
popdf2['NJ'] = [8.2, 8.4, 8.6]
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6


**What if assigned values fewer than number of rows**

In [21]:
debts = Series([1.2, 1.5, 1.7])
popdf['debt'] = debts
popdf

Unnamed: 0,state,year,pop,debt
0,Arizona,2005,5.9,1.2
1,Arizona,2010,6.6,1.5
2,Arizona,2015,6.8,1.7
3,Virginia,2010,7.9,
4,Virginia,2015,8.3,


**NaNs are used to pad insufficient number of values for column**

**Creating a new column with values as a function of the other columns**

In [22]:
rand2d = np.random.random((3,2))
randdf = DataFrame(rand2d)
randdf

Unnamed: 0,0,1
0,0.708237,0.771017
1,0.351976,0.353927
2,0.938481,0.671856


In [23]:
randdf.index = ['one', 'two', 'three']
randdf.columns = ['first', 'second']
randdf

Unnamed: 0,first,second
one,0.708237,0.771017
two,0.351976,0.353927
three,0.938481,0.671856


In [24]:
randdf['third'] = randdf['first'] > randdf['second']
randdf

Unnamed: 0,first,second,third
one,0.708237,0.771017,False
two,0.351976,0.353927,False
three,0.938481,0.671856,True


**Row (index) membership**

In [25]:
'three' in randdf.index

True

In [26]:
randdf['three']

KeyError: 'three'

**<font color="red">Above syntax of dataframe['name'] can only be used with column names</font>**

**Row indexing by labels, using loc**

In [None]:
randdf.loc['two']

first     0.720373
second    0.846598
third        False
Name: two, dtype: object

In [None]:
randdf.loc['two':'three']

Unnamed: 0,first,second,third
two,0.720373,0.846598,False
three,0.18992,0.664168,False


In [None]:
randdf.loc[1:2]  

TypeError: cannot do slice indexing on Index with these indexers [1] of type int

**<font color="red">Can't use numeric indexes here because dataframe is indexed by string labels**

---

**Adding a row using loc**

In [None]:
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6


In [None]:
popdf2.loc[2020] = [7.2, 8.6, 8.9]
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6
2020,7.2,8.6,8.9


In [None]:
popdf2.index

Index([2005, 2010, 2015, 2020], dtype='int64')

In [None]:
popdf2.loc[[2010,2020]]

Unnamed: 0,Arizona,Virginia,NJ
2010,6.6,7.9,8.4
2020,7.2,8.6,8.9


**Deleting a column with del operation**

In [None]:
popdf

Unnamed: 0,state,year,pop,debt
0,Arizona,2005,5.9,1.2
1,Arizona,2010,6.6,1.5
2,Arizona,2015,6.8,1.7
3,Virginia,2010,7.9,
4,Virginia,2015,8.3,


In [None]:
del popdf['debt']
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


**The del operator changes the original dataframe**

In [None]:
randdf

Unnamed: 0,first,second,third
one,0.412379,0.155821,True
two,0.720373,0.846598,False
three,0.18992,0.664168,False


In [None]:
del randdf['second']
randdf

Unnamed: 0,first,third
one,0.412379,True
two,0.720373,False
three,0.18992,False


In [None]:
del randdf.third
randdf

AttributeError: 'DataFrame' object has no attribute 'third'

In [None]:
randdf.third

one       True
two      False
three    False
Name: third, dtype: bool

**You can refer to a column using it as a property (randdf.third instead of randdf\['third'\] but you can't delete a column using it as a property**

---

#### <font color="brown">Indexing a DataFrame with iloc (using integer indices)</font>

In [None]:
popdf

Unnamed: 0,state,year,pop
0,Arizona,2005,5.9
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9
4,Virginia,2015,8.3


In [None]:
popdf.loc[1]

state    Arizona
year        2010
pop          6.6
Name: 1, dtype: object

In [None]:
popdf.loc[1,'year']

2010

**Using iloc**

In [None]:
popdf.iloc[1,0]   # use index for rows and columns

'Arizona'

In [None]:
popdf.iloc[1:4]   # slice rows

Unnamed: 0,state,year,pop
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9


**Note that slicing with iloc works just like with ndarrays, the upper bound is excluded**

In [None]:
popdf.iloc[[1,2,3]]  # list of row indexes

Unnamed: 0,state,year,pop
1,Arizona,2010,6.6
2,Arizona,2015,6.8
3,Virginia,2010,7.9


In [None]:
popdf.iloc[:,'state']

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

**<font color="red">With iloc you can only use integer indexes for rows and columns</font>**

In [None]:
popdf.iloc[:,0]

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

*same as*

In [None]:
popdf['state']

0     Arizona
1     Arizona
2     Arizona
3    Virginia
4    Virginia
Name: state, dtype: object

In [None]:
popdf2

Unnamed: 0,Arizona,Virginia,NJ
2005,5.9,,8.2
2010,6.6,7.9,8.4
2015,6.8,8.3,8.6
2020,7.2,8.6,8.9


In [None]:
popdf2.iloc[2:,[0,2]]

Unnamed: 0,Arizona,NJ
2015,6.8,8.6
2020,7.2,8.9


---

#### <font color="brown">Creating a DataFrame from a CSV file (typical usage)</font>

**Using the Pandas method read_csv**

In [None]:
import pandas as pd

In [None]:
mpgfile = open("auto_mpg_original.csv")
mpgs = pd.read_csv(mpgfile)
mpgs

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
...,...,...,...,...,...,...,...,...,...
401,27.0,4.0,140.0,86.0,2790.0,15.6,82.0,1.0,ford mustang gl
402,44.0,4.0,97.0,52.0,2130.0,24.6,82.0,2.0,vw pickup
403,32.0,4.0,135.0,84.0,2295.0,11.6,82.0,1.0,dodge rampage
404,28.0,4.0,120.0,79.0,2625.0,18.6,82.0,1.0,ford ranger


In [None]:
mpgs.shape

(406, 9)

**Note: NAs are read in as NaN which is basically a missing/null value.**

In [None]:
# first 15 rows
mpgs.head(15)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8.0,307.0,130.0,3504.0,12.0,70.0,1.0,chevrolet chevelle malibu
1,15.0,8.0,350.0,165.0,3693.0,11.5,70.0,1.0,buick skylark 320
2,18.0,8.0,318.0,150.0,3436.0,11.0,70.0,1.0,plymouth satellite
3,16.0,8.0,304.0,150.0,3433.0,12.0,70.0,1.0,amc rebel sst
4,17.0,8.0,302.0,140.0,3449.0,10.5,70.0,1.0,ford torino
5,15.0,8.0,429.0,198.0,4341.0,10.0,70.0,1.0,ford galaxie 500
6,14.0,8.0,454.0,220.0,4354.0,9.0,70.0,1.0,chevrolet impala
7,14.0,8.0,440.0,215.0,4312.0,8.5,70.0,1.0,plymouth fury iii
8,14.0,8.0,455.0,225.0,4425.0,10.0,70.0,1.0,pontiac catalina
9,15.0,8.0,390.0,190.0,3850.0,8.5,70.0,1.0,amc ambassador dpl


**Metadata - descriptive information - of dataframe**

In [None]:
mpgs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     406 non-null    float64
 2   displacement  406 non-null    float64
 3   horsepower    400 non-null    float64
 4   weight        406 non-null    float64
 5   acceleration  406 non-null    float64
 6   model year    406 non-null    float64
 7   origin        406 non-null    float64
 8   car name      406 non-null    object 
dtypes: float64(8), object(1)
memory usage: 28.7+ KB


**In the info above, note that each column now has an inferred datatype, not object.<br>
Also note the number of non-null values per column For instance, mpg has 8 missing values, and horsepower has 6 missing values.**

##### <font color="brown">Get all rows for which mpg column has a null value</font>

In [None]:
mpgs[mpgs['mpg'].isnull()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
10,,4.0,133.0,115.0,3090.0,17.5,70.0,2.0,citroen ds-21 pallas
11,,8.0,350.0,165.0,4142.0,11.5,70.0,1.0,chevrolet chevelle concours (sw)
12,,8.0,351.0,153.0,4034.0,11.0,70.0,1.0,ford torino (sw)
13,,8.0,383.0,175.0,4166.0,10.5,70.0,1.0,plymouth satellite (sw)
14,,8.0,360.0,175.0,3850.0,11.0,70.0,1.0,amc rebel sst (sw)
17,,8.0,302.0,140.0,3353.0,8.0,70.0,1.0,ford mustang boss 302
39,,4.0,97.0,48.0,1978.0,20.0,71.0,2.0,volkswagen super beetle 117
367,,4.0,121.0,110.0,2800.0,15.4,81.0,2.0,saab 900s


##### <font color="brown">Get all rows for which horsepower column has a null value</font>

In [None]:
mpgs[mpgs['horsepower'].isnull()]

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
38,25.0,4.0,98.0,,2046.0,19.0,71.0,1.0,ford pinto
133,21.0,6.0,200.0,,2875.0,17.0,74.0,1.0,ford maverick
337,40.9,4.0,85.0,,1835.0,17.3,80.0,2.0,renault lecar deluxe
343,23.6,4.0,140.0,,2905.0,14.3,80.0,1.0,ford mustang cobra
361,34.5,4.0,100.0,,2320.0,15.8,81.0,2.0,renault 18i
382,23.0,4.0,151.0,,3035.0,20.5,82.0,1.0,amc concord dl


##### <font color="brown">Get summary starts for numeric columns</font>
**describe** method

In [None]:
mpgs.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,398.0,406.0,406.0,400.0,406.0,406.0,406.0,406.0
mean,23.514573,5.475369,194.779557,105.0825,2979.413793,15.519704,75.921182,1.568966
std,7.815984,1.71216,104.922458,38.768779,847.004328,2.803359,3.748737,0.797479
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.5,4.0,105.0,75.75,2226.5,13.7,73.0,1.0
50%,23.0,4.0,151.0,95.0,2822.5,15.5,76.0,1.0
75%,29.0,8.0,302.0,130.0,3618.25,17.175,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


In [None]:
# median values
mpgs.describe().loc['50%']

mpg               23.0
cylinders          4.0
displacement     151.0
horsepower        95.0
weight          2822.5
acceleration      15.5
model year        76.0
origin             1.0
Name: 50%, dtype: float64

In [None]:
mpgs['mpg'].median()

23.0

---

#### <font color="brown">Numpy ufuncs work with DataFrames</font>

In [None]:
df = DataFrame(np.random.randn(4,3),columns=list("ABC"),index=["One","Two","Three",'Four'])
df

Unnamed: 0,A,B,C
One,0.299397,-0.40184,-0.078553
Two,0.290432,-0.617779,1.527505
Three,-1.398984,-0.594014,-1.321731
Four,0.108096,-0.208644,-0.202883


In [None]:
np.abs(df)  

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,0.617779,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
df  # is original changed?

Unnamed: 0,A,B,C
One,0.299397,-0.40184,-0.078553
Two,0.290432,-0.617779,1.527505
Three,-1.398984,-0.594014,-1.321731
Four,0.108096,-0.208644,-0.202883


**Original is not changed**

**Alternatively can use dataframe method abs(), this won't change original df either**

In [None]:
df.abs()

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,0.617779,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
df

Unnamed: 0,A,B,C
One,0.299397,-0.40184,-0.078553
Two,0.290432,-0.617779,1.527505
Three,-1.398984,-0.594014,-1.321731
Four,0.108096,-0.208644,-0.202883


**Assign to get modified result**

In [None]:
dfabs = df.abs()
dfabs

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,0.617779,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
dfabs.mean()  

A    0.524227
B    0.455570
C    0.782668
dtype: float64

**<font color="red">Note: default axis is 0, so above gets column means</font>**

In [None]:
dfabs.mean(axis=1)  # row means

One      0.259930
Two      0.811905
Three    1.104910
Four     0.173208
dtype: float64

In [None]:
dfabs.cumsum(axis=1)   # cumulative sums of rows

Unnamed: 0,A,B,C
One,0.299397,0.701237,0.779791
Two,0.290432,0.908211,2.435716
Three,1.398984,1.992998,3.314729
Four,0.108096,0.316741,0.519623


In [None]:
dfabs.sum()     # sum of each column

A    2.096910
B    1.822278
C    3.130672
dtype: float64

##### **What if there are NaN values?**

In [None]:
dfabs2 = dfabs.copy()
dfabs2

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,0.617779,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
dfabs2.iloc[1,1] = np.nan
dfabs2

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
dfabs2['B'].sum()   

1.2044992585391945

**NaNs are skipped when summing**<br>
**But they can be considered if needed, with skipna parameter set to False**

In [None]:
dfabs2.mean(skipna=False)  

A    0.524227
B         NaN
C    0.782668
dtype: float64

**If any of the values is NaN, the result is NaN when skipna=False**

In [None]:
dfabs

Unnamed: 0,A,B,C
One,0.299397,0.40184,0.078553
Two,0.290432,0.617779,1.527505
Three,1.398984,0.594014,1.321731
Four,0.108096,0.208644,0.202883


In [None]:
dfabs['C'].argmax()

1

In [None]:
dfabs.loc['Three'].argmax()

0