# 1  Working with different data types.

In this section, we will cover the following topics:
• Finding data type information about the datset
• Converting from one data type to another
• Converting date time data
• Selecting columns based on data types
• Some additional topics
Let us get started.
If you do not already have pandas installed, you need to use pip to install it as shown below.

In [1]:
pip install pandas

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.



Next we need to import pandas to our notebook.

In [2]:
import pandas as pd

Let us now look at the dataset for this task. The dataset is about significant earthquakes with a magnitude of 5.5 or higher, having information about their date, time and location. The dataset can be downloaded by clicking [here](https://www.kaggle.com/datasets/usgs/earthquake-database). Only 1000 rows from the middle of the dataset have been used in this demo for the sake of simplicity. You can use the code below to shorten your original dataset after downloading it from the above link. 

df1 = pd.read_csv('Downloaded_dataset.csv')
#This is the downloaded dataset.

df1= df1[1000:2000]
#Slicing the dataset to arbitrarily select only 1000 rows.

pd.to_csv('significant_earthquakes.csv') #change pd as per your dataframe name in my case its df1 as shown below :performed:
#Saving the sliced dataset to a local csv file named
#'significant_earthquakes.csv' so that we use slicing every time we run this notebook.

Let us now read it into a pandas DataFrame named df1. If you want to know more about pandas DataFrame, click [here](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [3]:
df1= pd.read_csv('database.csv')

In [4]:
df1= df1[1000:2000]

In [6]:
df1.to_csv('significant_earthquakes.csv')

In [7]:
df1= pd.read_csv('significant_earthquakes.csv')

Let us now look at the first 5 rows of the DataFrame df1 by using .head() function

In [8]:
df1.head()

Unnamed: 0.1,Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,1000,07/29/1968,11:12:02,-22.531,-174.86,Earthquake,45.0,,,6.3,...,,,,,,ISCGEM819475,ISCGEM,ISCGEM,ISCGEM,Automatic
1,1001,07/29/1968,23:52:18,-0.291,133.409,Earthquake,20.0,,,6.2,...,,,,,,ISCGEM819494,ISCGEM,ISCGEM,ISCGEM,Automatic
2,1002,07/30/1968,20:38:43,-6.987,-80.544,Earthquake,31.8,,,6.5,...,,,,,,ISCGEM819518,ISCGEM,ISCGEM,ISCGEM,Automatic
3,1003,08/01/1968,00:14:17,-26.822,-177.09,Earthquake,127.2,,,5.7,...,,,,,,ISCGEM817534,ISCGEM,ISCGEM,ISCGEM,Automatic
4,1004,08/01/1968,20:19:22,16.316,122.067,Earthquake,25.0,,,7.6,...,,,,,,ISCGEM817557,ISCGEM,ISCGEM,ISCGEM,Automatic


The first column ‘Unnamed: 0’ is not useful so it can be dropped.

In [9]:
df1.drop(['Unnamed: 0'], axis= 1, inplace= True)

Let us use .head() to confirm if the column has been dropped.

In [11]:
df1.head()

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,07/29/1968,11:12:02,-22.531,-174.86,Earthquake,45.0,,,6.3,MW,...,,,,,,ISCGEM819475,ISCGEM,ISCGEM,ISCGEM,Automatic
1,07/29/1968,23:52:18,-0.291,133.409,Earthquake,20.0,,,6.2,MW,...,,,,,,ISCGEM819494,ISCGEM,ISCGEM,ISCGEM,Automatic
2,07/30/1968,20:38:43,-6.987,-80.544,Earthquake,31.8,,,6.5,MW,...,,,,,,ISCGEM819518,ISCGEM,ISCGEM,ISCGEM,Automatic
3,08/01/1968,00:14:17,-26.822,-177.09,Earthquake,127.2,,,5.7,MW,...,,,,,,ISCGEM817534,ISCGEM,ISCGEM,ISCGEM,Automatic
4,08/01/1968,20:19:22,16.316,122.067,Earthquake,25.0,,,7.6,MW,...,,,,,,ISCGEM817557,ISCGEM,ISCGEM,ISCGEM,Automatic


# 1.1  Finding data type information about the dataset

dtypes can be applied on a DataFrame as shown to see the data types of all the columns.

In [12]:
df1.dtypes

Date                           object
Time                           object
Latitude                      float64
Longitude                     float64
Type                           object
Depth                         float64
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Type                 object
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
ID                             object
Source                         object
Location Source                object
Magnitude Source               object
Status                         object
dtype: object

As we can see, there are columns with object and float64 data type. 

In [14]:
df1['Latitude'].dtypes

dtype('float64')

.dtypes can also be applied to an individual column.
You can also use .info() to see all the data types under the header ‘Dtype’.

In [15]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Date                        1000 non-null   object 
 1   Time                        1000 non-null   object 
 2   Latitude                    1000 non-null   float64
 3   Longitude                   1000 non-null   float64
 4   Type                        1000 non-null   object 
 5   Depth                       1000 non-null   float64
 6   Depth Error                 4 non-null      float64
 7   Depth Seismic Stations      6 non-null      float64
 8   Magnitude                   1000 non-null   float64
 9   Magnitude Type              1000 non-null   object 
 10  Magnitude Error             3 non-null      float64
 11  Magnitude Seismic Stations  6 non-null      float64
 12  Azimuthal Gap               6 non-null      float64
 13  Horizontal Distance         5 non-

# 1.2  Converting from one data type to another

pandas allows us to convert from one data type to another using a function .astype().  Let us create a copy of our DataFrame named df2, and convert the data type of latitude column in df2 to int64.

In [17]:
df2= df1.copy()
df2['Latitude'] = df1 ['Latitude'].astype('int64')
df2.dtypes

Date                           object
Time                           object
Latitude                        int64
Longitude                     float64
Type                           object
Depth                         float64
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Type                 object
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
ID                             object
Source                         object
Location Source                object
Magnitude Source               object
Status                         object
dtype: object

However, converting a floating value to an integer value will result in loss of decimal digits and therefore, it is not a standard practice.

# 1.3  Converting date time data

Let us again look at the data types for the DataFrame df2.
Notice how the ‘Date’ and ‘Time’ column is showing ‘object’ as the data type.

In [18]:
df1.dtypes

Date                           object
Time                           object
Latitude                      float64
Longitude                     float64
Type                           object
Depth                         float64
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Type                 object
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
ID                             object
Source                         object
Location Source                object
Magnitude Source               object
Status                         object
dtype: object

Date or time data existing within a DataFrame should be converted to the ‘datetime64’ data type. This helps us to work eﬀiciently with date and time values and perform operations such as finding the number of days between two successive dates. In our dataset, the “Date” and “Time” columns exist as object data types.  Let us try and find the number of days between the 0th and the 1st row.

In [19]:
df1['Date'][1]-df1['Date'][0]

TypeError: unsupported operand type(s) for -: 'str' and 'str'

This operation is not supported because the Date column has data type object or dates are being treated as string values which do not support the ‘difference’ operation.
Let us now convert the Date column to datetime64 data type. to_datetime is the pandas functionfor converting to datetime data type.

In [21]:
df1['Date']= pd.to_datetime(df1['Date'], format= '%m/%d/%Y')

Let us again check if the Date column has the correct data type by using .dtypes

In [22]:
df1.dtypes

Date                          datetime64[ns]
Time                                  object
Latitude                             float64
Longitude                            float64
Type                                  object
Depth                                float64
Depth Error                          float64
Depth Seismic Stations               float64
Magnitude                            float64
Magnitude Type                        object
Magnitude Error                      float64
Magnitude Seismic Stations           float64
Azimuthal Gap                        float64
Horizontal Distance                  float64
Horizontal Error                     float64
Root Mean Square                     float64
ID                                    object
Source                                object
Location Source                       object
Magnitude Source                      object
Status                                object
dtype: object

Here we can see that the data type has changed to datetime64. Let us look at how Date column has changed as a result of this operation. You will notice that year is appearing first, followed by month, further followed by day.

In [23]:
df1.head(10)

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,1968-07-29,11:12:02,-22.531,-174.86,Earthquake,45.0,,,6.3,MW,...,,,,,,ISCGEM819475,ISCGEM,ISCGEM,ISCGEM,Automatic
1,1968-07-29,23:52:18,-0.291,133.409,Earthquake,20.0,,,6.2,MW,...,,,,,,ISCGEM819494,ISCGEM,ISCGEM,ISCGEM,Automatic
2,1968-07-30,20:38:43,-6.987,-80.544,Earthquake,31.8,,,6.5,MW,...,,,,,,ISCGEM819518,ISCGEM,ISCGEM,ISCGEM,Automatic
3,1968-08-01,00:14:17,-26.822,-177.09,Earthquake,127.2,,,5.7,MW,...,,,,,,ISCGEM817534,ISCGEM,ISCGEM,ISCGEM,Automatic
4,1968-08-01,20:19:22,16.316,122.067,Earthquake,25.0,,,7.6,MW,...,,,,,,ISCGEM817557,ISCGEM,ISCGEM,ISCGEM,Automatic
5,1968-08-02,13:30:25,27.579,60.979,Earthquake,67.4,,,5.9,MW,...,,,,,,ISCGEM817580,ISCGEM,ISCGEM,ISCGEM,Automatic
6,1968-08-02,14:06:43,16.519,-97.739,Earthquake,25.0,,,7.3,MW,...,,,,,,ISCGEM817583,ISCGEM,ISCGEM,ISCGEM,Automatic
7,1968-08-03,04:54:34,25.637,128.49,Earthquake,15.0,,,6.7,MW,...,,,,,,ISCGEM817611,ISCGEM,ISCGEM,ISCGEM,Automatic
8,1968-08-03,06:25:05,16.475,122.23,Earthquake,25.0,,,6.4,MW,...,,,,,,ISCGEM817612,ISCGEM,ISCGEM,ISCGEM,Automatic
9,1968-08-04,11:41:25,6.498,126.777,Earthquake,100.0,,,6.1,MW,...,,,,,,ISCGEM817656,ISCGEM,ISCGEM,ISCGEM,Automatic


Let us now find the difference between two dates in the dataset and look at the result.

In [24]:
df1['Date'][1]-df1['Date'][0]

Timedelta('0 days 00:00:00')

You can now see that the difference between the two dates is returned.  A time stamp is alsoreturned which gives the difference in hours:minutes:seconds if any.

Timedelta represents absolute differences in time. You can now try to convert the ‘Time’ columnto the ‘datetime64’ data type.

# 1.3.1  Further you can find other information from the datetime64 data type as shown:

In [26]:
df1['Date'][1]

Timestamp('1968-07-29 00:00:00')

In [27]:
df1['Date'][1].day

29

In [28]:
df1['Date'][1].month

7

In [29]:
df1['Date'][1].year

1968

# 2  Selecting column based on data types

pandas allows you to select columns based on their data types by using select_dtypes function. For example, we can choose the columns with data type ‘float’ to create a new DataFrame out of the existing DataFrame. Let us revisit the data types for df1

In [30]:
df1.dtypes

Date                          datetime64[ns]
Time                                  object
Latitude                             float64
Longitude                            float64
Type                                  object
Depth                                float64
Depth Error                          float64
Depth Seismic Stations               float64
Magnitude                            float64
Magnitude Type                        object
Magnitude Error                      float64
Magnitude Seismic Stations           float64
Azimuthal Gap                        float64
Horizontal Distance                  float64
Horizontal Error                     float64
Root Mean Square                     float64
ID                                    object
Source                                object
Location Source                       object
Magnitude Source                      object
Status                                object
dtype: object

Now let us create a new DataFrame named decimals which contains the columns from df1 with the data type float64.

In [32]:
decimals= df1.select_dtypes('float')
decimals.head()

Unnamed: 0,Latitude,Longitude,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square
0,-22.531,-174.86,45.0,,,6.3,,,,,,
1,-0.291,133.409,20.0,,,6.2,,,,,,
2,-6.987,-80.544,31.8,,,6.5,,,,,,
3,-26.822,-177.09,127.2,,,5.7,,,,,,
4,16.316,122.067,25.0,,,7.6,,,,,,


Check the data type for the newly created DataFrame.

In [33]:
decimals.dtypes

Latitude                      float64
Longitude                     float64
Depth                         float64
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
dtype: object

Similarly, you may choose to exclude certain data types while working with DataFrames.  Forexample, let us create a DataFrame number_data, which does not have any column from df1 with the object data type.

In [34]:
number_data = df1.select_dtypes(exclude= 'object')

Check the column and data type details for the number_data.

In [36]:
number_data.head()

Unnamed: 0,Date,Latitude,Longitude,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Error,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square
0,1968-07-29,-22.531,-174.86,45.0,,,6.3,,,,,,
1,1968-07-29,-0.291,133.409,20.0,,,6.2,,,,,,
2,1968-07-30,-6.987,-80.544,31.8,,,6.5,,,,,,
3,1968-08-01,-26.822,-177.09,127.2,,,5.7,,,,,,
4,1968-08-01,16.316,122.067,25.0,,,7.6,,,,,,


In [37]:
number_data.dtypes

Date                          datetime64[ns]
Latitude                             float64
Longitude                            float64
Depth                                float64
Depth Error                          float64
Depth Seismic Stations               float64
Magnitude                            float64
Magnitude Error                      float64
Magnitude Seismic Stations           float64
Azimuthal Gap                        float64
Horizontal Distance                  float64
Horizontal Error                     float64
Root Mean Square                     float64
dtype: object

# 3  Additional tasks

Let us look at some additional ways of working with data types. ## Changing data types while importing data
Data types in a DataFrame can be customised at the time of reading the data using read_csv as shown. this can be applied to multiple columns at once by passing the column name and data type as a key value pair.

In [40]:
dtypes_dict = {"Depth":"object",}

In [41]:
df3= pd.read_csv('significant_earthquakes.csv', dtype=dtypes_dict)

In [42]:
df3.dtypes

Unnamed: 0                      int64
Date                           object
Time                           object
Latitude                      float64
Longitude                     float64
Type                           object
Depth                          object
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Type                 object
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
ID                             object
Source                         object
Location Source                object
Magnitude Source               object
Status                         object
dtype: object

Data types also influence the memory usage of our data set. Let us see the memory being consumedby our dataset.

In [43]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  1000 non-null   int64  
 1   Date                        1000 non-null   object 
 2   Time                        1000 non-null   object 
 3   Latitude                    1000 non-null   float64
 4   Longitude                   1000 non-null   float64
 5   Type                        1000 non-null   object 
 6   Depth                       1000 non-null   object 
 7   Depth Error                 4 non-null      float64
 8   Depth Seismic Stations      6 non-null      float64
 9   Magnitude                   1000 non-null   float64
 10  Magnitude Type              1000 non-null   object 
 11  Magnitude Error             3 non-null      float64
 12  Magnitude Seismic Stations  6 non-null      float64
 13  Azimuthal Gap               6 non-

# 3.1  Category data type to save memory

Data type ‘category’ can be used for columns containing categorical data.  Usally, it is seen that categorical columns have ‘object’ data type.  The data type for such columns can be changed to‘categorical’ as shown to save memory. You can compare the memory usage of the dataset before and after type conversion to ‘category’.

In [46]:
dtypes1= {"Type":"category",
         "Status":"category",
         "Source":"category",
         "Location Source":"category",
         "Magnitude Source":"category",
         "Magnitude Type":"category"}

df4= pd.read_csv('significant_earthquakes.csv',dtype=dtypes1, parse_dates=['Date','Time'])

In [47]:
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 22 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Unnamed: 0                  1000 non-null   int64         
 1   Date                        1000 non-null   datetime64[ns]
 2   Time                        1000 non-null   datetime64[ns]
 3   Latitude                    1000 non-null   float64       
 4   Longitude                   1000 non-null   float64       
 5   Type                        1000 non-null   category      
 6   Depth                       1000 non-null   float64       
 7   Depth Error                 4 non-null      float64       
 8   Depth Seismic Stations      6 non-null      float64       
 9   Magnitude                   1000 non-null   float64       
 10  Magnitude Type              1000 non-null   category      
 11  Magnitude Error             3 non-null      float64      

The memory usage reduces from 172 KB to 131.5+ KB. Therefore, it is important to. have the rightkind of data type for each column.

Congratulations on finishing this demo! You should now be able to work with differentdata types while using pandas.15