### <span style="color:black"><b>Pandas Tutorial 6</b></span>

---

<u>Changing data types</u>

There are times where we might have data that has an incorrect dtype, which can cause problems if we want to perform aggregation functions or filter rows of data. This video serves as an introduction to how we can fix that in pandas

Useful series methods:
<pre>
df['series_name'].astype(...)
</pre>

* We can use the `astype()` pandas call when we wish to change the datatype of a particular series
* [Documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html) for `astype()`

Useful top level functions:
<pre>
pd.to_numeric(df['series_name'])
pd.to_datetime(df['series_name'])
</pre>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('people.csv')
df

Unnamed: 0,P_ID,Born,Height,Weight,Test
0,--,--,--,--,--
1,1.0,2010-05-31,150.8,70.0,4
2,2.0,2008-06-30,179.4,78.0,3
3,3.0,2008-08-01,160.7,70.0,weird value
4,4.0,2013-08-03,110.2,45.0,6
5,5.0,2013-10-01,100.1,45.0,2
6,6.0,2021-08-03,40.0,,another weird value
7,7.0,2003-12-01,200.3,90.0,0


In [3]:
# Lets get rid of that top row
new_df = df.iloc[1:, :].copy()
new_df

Unnamed: 0,P_ID,Born,Height,Weight,Test
1,1.0,2010-05-31,150.8,70.0,4
2,2.0,2008-06-30,179.4,78.0,3
3,3.0,2008-08-01,160.7,70.0,weird value
4,4.0,2013-08-03,110.2,45.0,6
5,5.0,2013-10-01,100.1,45.0,2
6,6.0,2021-08-03,40.0,,another weird value
7,7.0,2003-12-01,200.3,90.0,0


In [4]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 1 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   P_ID    7 non-null      object
 1   Born    7 non-null      object
 2   Height  7 non-null      object
 3   Weight  6 non-null      object
 4   Test    7 non-null      object
dtypes: object(5)
memory usage: 412.0+ bytes


In [5]:
# Get Height > 150
# Won't run because height is not stored as numeric yet

# new_df.loc[new_df['Height'] > 150, :]

In [6]:
# Change to numeric
new_df['P_ID'] = pd.to_numeric(new_df['P_ID'])
new_df['Height'] = pd.to_numeric(new_df['Height'])
new_df['Weight'] = pd.to_numeric(new_df['Weight'])

# Change to date
new_df['Born'] = pd.to_datetime(new_df['Born'])

new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 1 to 7
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   P_ID    7 non-null      float64       
 1   Born    7 non-null      datetime64[ns]
 2   Height  7 non-null      float64       
 3   Weight  6 non-null      float64       
 4   Test    7 non-null      object        
dtypes: datetime64[ns](1), float64(3), object(1)
memory usage: 412.0+ bytes


In [7]:
# Now change to int
new_df['P_ID'] = new_df['P_ID'].astype(int)

# Allows us to change to int in the prescense of nulls
new_df['Weight'] = new_df['Weight'].astype('Int64')

new_df

Unnamed: 0,P_ID,Born,Height,Weight,Test
1,1,2010-05-31,150.8,70.0,4
2,2,2008-06-30,179.4,78.0,3
3,3,2008-08-01,160.7,70.0,weird value
4,4,2013-08-03,110.2,45.0,6
5,5,2013-10-01,100.1,45.0,2
6,6,2021-08-03,40.0,,another weird value
7,7,2003-12-01,200.3,90.0,0


In [8]:
# Values it can't convert can be turned into null values (which I will explore in a later notebook)
new_df['Test'] = pd.to_numeric(new_df['Test'], errors='coerce')
new_df

Unnamed: 0,P_ID,Born,Height,Weight,Test
1,1,2010-05-31,150.8,70.0,4.0
2,2,2008-06-30,179.4,78.0,3.0
3,3,2008-08-01,160.7,70.0,
4,4,2013-08-03,110.2,45.0,6.0
5,5,2013-10-01,100.1,45.0,2.0
6,6,2021-08-03,40.0,,
7,7,2003-12-01,200.3,90.0,0.0
