<a href="https://colab.research.google.com/github/pbeens/OTF-Data-Analysis-2021-05/blob/main/Fixing_Data_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**If you open this document in Colab, use the Table of Contents on the left to go directly to the section you are interested in.**

# Converting to Dates

This examples shows a date field (FILE_DATE) that need to be converted from "object" to "datetime" so the data can be plotted correctly.

The data is Ontario Covid-19, organized by health region.

In [1]:
# Read in and prep the test data
# Observation: FILE_DATE is an "object" not a valid date format

import pandas as pd
from datetime import date

url = 'https://data.ontario.ca/dataset/1115d5fe-dd84-4c69-b5ed-05bf0c0a0ff9/resource/d1bfe1ad-6575-4352-8302-09ca81f7ddfc/download/cases_by_status_and_phu.csv'
df = pd.read_csv(url)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13702 entries, 0 to 13701
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   FILE_DATE       13702 non-null  object 
 1   PHU_NAME        13701 non-null  object 
 2   PHU_NUM         13701 non-null  float64
 3   ACTIVE_CASES    13702 non-null  int64  
 4   RESOLVED_CASES  13702 non-null  int64  
 5   DEATHS          13702 non-null  int64  
dtypes: float64(1), int64(3), object(2)
memory usage: 642.4+ KB


In [2]:
# The method we need is to_datetime()
# Observation: FILE_DATE has been converted to datetime64[ns]

df['FILE_DATE']= pd.to_datetime(df['FILE_DATE'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13702 entries, 0 to 13701
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   FILE_DATE       13702 non-null  datetime64[ns]
 1   PHU_NAME        13701 non-null  object        
 2   PHU_NUM         13701 non-null  float64       
 3   ACTIVE_CASES    13702 non-null  int64         
 4   RESOLVED_CASES  13702 non-null  int64         
 5   DEATHS          13702 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(1)
memory usage: 642.4+ KB


# Deleting Rows

This data has an extra row at the bottom that needs to be removed.

In [3]:
# Read in and prep the test data
# The data is a partial extract from https://dieselnet.com/standards/us/fe.php
# Observation: The last line is not needed and needs to be deleted

import pandas as pd

url = 'https://raw.githubusercontent.com/pbeens/OTF-Data-Analysis-2021-05/main/datafiles/bad_cafe_data.csv'

df = pd.read_csv(url)
df

Unnamed: 0,Year,CAFE MPG,All Light Trucks CAFE MPG
0,2007,27.5,22.2
1,2008,27.5,22.5
2,2009,27.5,23.1
3,2010,27.5,23.5
4,2011,27.5,a
5,a Reformed CAFE standards,a Reformed CAFE standards,a Reformed CAFE standards


In [4]:
# Let's use slicing to only keep the lines we want (0-4).
# For more info on slicing, see https://realpython.com/lessons/indexing-and-slicing/

df = df[:-1] 
df

Unnamed: 0,Year,CAFE MPG,All Light Trucks CAFE MPG
0,2007,27.5,22.2
1,2008,27.5,22.5
2,2009,27.5,23.1
3,2010,27.5,23.5
4,2011,27.5,a
