<h1>Working with different data types</h1> <br>
<p>This section will cover the following topics:
    <ul>
        <li>Finding data type information about the dataset</li>
        <li>Converting from one data type to another</li>
        <li>Selecting columns based on data types</li>
        <li>Converting data time data</li>
        <li>Additional topics</li>
    </ul>
</p>

<strong>Install pip:</strong>

In [None]:
!pip install pandas

<strong>Import pandas to notebook:</strong>

In [None]:
import pandas as pd

The dataset is about significant earthquakes with a magnitude of 5.5 or higher, providing information about their date, time, and location. Only 1000 rows from the middle o the dataset have been used in this demo for the sake of simplicity.

In [None]:
#Read dataset
df1 = pd.read_csv('../csv-files/database.csv')

#Slice the dataset to arbitrarily select only 1000 rows
df1 = df1[1000:2000]

#Save the sliced dataset to a local csv file
df1.to_csv('significant_earthquakes.csv')

<strong>Read sliced dataset:</strong>

In [None]:
df1 = pd.read_csv('significant_earthquakes.csv')

<strong>Look at the first 5 rows:</strong>

In [None]:
df1.head()

The first column Unnamed: 0 is not useful so it can be dropped.

In [None]:
df1.drop(['Unnamed: 0'], axis = 1, inplace = True)

In [None]:
#Check if column has been dropped
df1.head()

<h1>Finding data type information about the dataset</h1>

<strong>Use .dtypes to see the data types of all the columns</strong>

In [None]:
df1.dtypes

<strong>.dtypes can also be applied to an individual column:</strong>

In [None]:
df1['Latitude'].dtypes

<strong>.info() can also be used to see all the data types under the header Dtype:<strong>

In [None]:
df1.info()

<h1>Converting from one data type to another</h1>

<strong>Use .astype() to convert from one data type to another:</strong>

In [None]:
df2 = df1.copy() #copy the dataframe
df2['Latitude'] = df1['Latitude'].astype('int64') #convert the data type to int64
df2.dtypes

<h1>Converting date time data</h1>

Look at the data frame again. Notice how Date and Time column is showing <em>object</em> as the data type.

In [None]:
df1.dtypes

<strong>Use .to_datetime to convert datetime data type:</strong>

In [None]:
df1['Date'] = pd.to_datetime(df1['Date'], format = '%m/%d/%Y')

In [None]:
df1.head(10)

In [None]:
df1.dtypes

Find the difference between two dates in the dataset and look at the result.

In [None]:
df1['Date'][6] - df1['Date'][0]

The differrence between the two dates is returned. A time stamp is also returned which gives the difference in hours:minutes:seconds.

<em>Now with Time:</em>

In [None]:
df1['Time'] = pd.to_datetime(df1['Time'], format = '%H:%M:%S')

In [None]:
df1['Time'][6] - df1['Time'][0]

<h2>More ways to find other information from the datetime64 data type</h2>

In [None]:
df1['Date'][1]

In [None]:
#Find the day
df1['Date'][1].day

In [None]:
#Find the month
df1['Date'][1].month

In [None]:
#Find the year
df1['Date'][1].year

<h1>Selecting columns based on data types</h1>

<strong>Use .select_dtypes() to select columns based on their data types:</strong>

In [None]:
#First, revisit the data types for df1
df1.dtypes

In [None]:
#Now create a new DataFrame named decimals which contains the columns from df1 with the data type float64
decimals = df1.select_dtypes('float')

#Show first 5 data
decimals.head()

In [None]:
decimals.dtypes

<strong>Add <em>exclude</em> parameter to exclude certain data types:</strong>

In [None]:
#Create a DataFrame which does not contain any column with the object data type
number_data = df1.select_dtypes(exclude = 'object')

In [None]:
number_data.head()

In [None]:
number_data.dtypes

<h1>Additional ways of working with data types.</h1>

<strong>Changing data types while importing data:</strong>

In [None]:
#Create a dictionary which contains the column and its to-be-modified data type as key-value pair
dtypes_dict = {'Depth': 'object'}

In [None]:
#Change column Depth data type to object at the time o reading the data
df3 = pd.read_csv('significant_earthquakes.csv', dtype = dtypes_dict)

In [None]:
df3.dtypes

Data types also influence the memory usage of the data set. To see the the memory being consumed by the dataset, use .info()

In [None]:
df3.info()

<strong><em>Category</em> data type to save memory:</strong>

Data type <em>category</em> can be used for columns containing categorical data. Usually, it is seen that categorical columns have ‘object’ data type.  The data type for such columns can be changed to ‘categorical’ as shown to save memory.

In [None]:
dtypes1 = {
    'Type': 'category',
    'Status': 'category',
    'Source': 'category',
    'Location Source': 'category',
    'Magnitude Source': 'category',
    'Magnitude Type': 'category'
}

df4 = pd.read_csv('significant_earthquakes.csv', dtype = dtypes1, parse_dates = ['Date', 'Time'])

In [None]:
df4.info()

The memory usage reduces from 172 KB to 131.5 KB. Therefore, it is important to have the right kind of data type for each column.