### This jupyter notebook contains contents of the lesson 2 "EDA - Data Cleaning Part2"

**Author : Umidjon Sattorov. Machine Learning engineer**

## Type casting

In [1]:
# Importing necessary libraries
import pandas as pd 

Let's create dataset for two people containing informations like weight and height. But weight information should be written intentionally in string format.

In [2]:
df = pd.DataFrame(data={'weight': ['64', '75'], 'height': [160, 185]})
df

Unnamed: 0,weight,height
0,64,160
1,75,185


Because the first column contains string text, most arithmetic operations fail when implemented.

In [3]:
df['weight'].sum()

'6475'

In [4]:
df['height'].sum()

np.int64(345)

Let's just check the data type for the columns.

In [5]:
df.dtypes

weight    object
height     int64
dtype: object

From the information above, we can conclude that weight data type is just string. That's why it returned object as result. To convert this data type to numerical, we can use the function pd.to_numeric().

In [6]:
df['weight'] = pd.to_numeric(df['weight'])

In [7]:
# Let's check it again
df['weight'].dtype

dtype('int64')

In [8]:
df['weight'].sum()

np.int64(139)

Let's go over to real life examples. From the previous lesson, read the table data from cleaned dataframe.

In [9]:
df_clean = pd.read_csv('./dataset/df_clean.csv')
df_clean.head()

Unnamed: 0,id,url,region,region_url,price,year,manufacturer,model,fuel,odometer,title_status,transmission,image_url,description,state,lat,long,posting_date,price_category
0,7308295377,https://chattanooga.craigslist.org/ctd/d/chatt...,chattanooga,https://chattanooga.craigslist.org,54990,2020.0,ram,2500 crew cab big horn,diesel,27442.0,clean,other,https://images.craigslist.org/00N0N_1xMPvfxRAI...,Carvana is the safer way to buy a car During t...,tn,35.06,-85.25,2021-04-17T12:30:50-0400,high
1,7316380095,https://newjersey.craigslist.org/ctd/d/carlsta...,north jersey,https://newjersey.craigslist.org,16942,2016.0,ford,explorer 4wd 4dr xlt,,60023.0,clean,automatic,https://images.craigslist.org/00x0x_26jl9F0cnL...,***Call Us for more information at: 201-635-14...,nj,40.821805,-74.061962,2021-05-03T15:40:21-0400,medium
2,7313733749,https://reno.craigslist.org/ctd/d/atlanta-2017...,reno / tahoe,https://reno.craigslist.org,35590,2017.0,volkswagen,golf r hatchback,gas,14048.0,clean,other,https://images.craigslist.org/00y0y_eeZjWeiSfb...,Carvana is the safer way to buy a car During t...,ca,33.779214,-84.411811,2021-04-28T03:52:20-0700,high
3,7308210929,https://fayetteville.craigslist.org/ctd/d/rale...,fayetteville,https://fayetteville.craigslist.org,14500,2013.0,toyota,rav4,gas,117291.0,clean,automatic,https://images.craigslist.org/00606_iGe5iXidib...,2013 Toyota RAV4 XLE 4dr SUV Offered by: R...,nc,35.715954,-78.655304,2021-04-17T10:08:57-0400,medium
4,7316474668,https://newyork.craigslist.org/lgi/cto/d/baldw...,new york city,https://newyork.craigslist.org,21800,2021.0,nissan,altima,gas,8000.0,clean,automatic,https://images.craigslist.org/00V0V_3pSOiPZ3Sd...,2021 Nissan Altima Sv with Only 8 K Miles Titl...,ny,40.6548,-73.6097,2021-05-03T18:32:06-0400,medium


In [10]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              10000 non-null  int64  
 1   url             10000 non-null  object 
 2   region          10000 non-null  object 
 3   region_url      10000 non-null  object 
 4   price           10000 non-null  int64  
 5   year            9964 non-null   float64
 6   manufacturer    10000 non-null  object 
 7   model           9872 non-null   object 
 8   fuel            9937 non-null   object 
 9   odometer        10000 non-null  float64
 10  title_status    9834 non-null   object 
 11  transmission    9955 non-null   object 
 12  image_url       9998 non-null   object 
 13  description     9998 non-null   object 
 14  state           10000 non-null  object 
 15  lat             9902 non-null   float64
 16  long            9902 non-null   float64
 17  posting_date    9998 non-null   

In [11]:
df_types = df_clean.copy()

In [12]:
df_types['odometer'] = df_types['odometer'].astype(int)

In [13]:
df_types['odometer'].dtype

dtype('int64')

In [14]:
df_types['odometer']

0        27442
1        60023
2        14048
3       117291
4         8000
         ...  
9995    150000
9996    113573
9997    150184
9998     61943
9999     35921
Name: odometer, Length: 10000, dtype: int64

In [15]:
df_types['odometer'].astype('string')[0]

'27442'

In [16]:
# Problem with data time object
df_types['posting_date'].values.tolist()

['2021-04-17T12:30:50-0400',
 '2021-05-03T15:40:21-0400',
 '2021-04-28T03:52:20-0700',
 '2021-04-17T10:08:57-0400',
 '2021-05-03T18:32:06-0400',
 '2021-04-08T15:10:56-0400',
 '2021-05-04T11:59:42-0500',
 '2021-04-23T19:34:13-0400',
 '2021-04-30T17:20:30-0400',
 '2021-04-27T21:14:22-0500',
 '2021-04-30T15:20:33-0400',
 '2021-05-03T21:03:32-0400',
 '2021-04-21T13:03:14-0400',
 '2021-05-04T20:22:11-0700',
 '2021-04-08T14:36:36-0700',
 '2021-05-04T11:01:47-1000',
 '2021-04-28T19:09:23-0500',
 '2021-04-19T10:42:16-0600',
 '2021-04-23T11:02:25-0700',
 '2021-04-29T14:06:55-0400',
 '2021-05-01T15:46:41-0700',
 '2021-04-25T10:10:43-0500',
 '2021-04-25T13:28:23-0700',
 '2021-04-21T11:06:50-0400',
 '2021-05-03T13:43:57-0700',
 '2021-04-22T08:02:36-0400',
 '2021-04-11T12:11:15-0400',
 '2021-04-23T14:27:25-0500',
 '2021-04-16T19:10:16-0400',
 '2021-04-08T15:16:13-0500',
 '2021-04-22T09:16:38-0500',
 '2021-04-28T19:40:08-0700',
 '2021-04-20T10:40:15-0700',
 '2021-04-09T19:46:04-0700',
 '2021-04-22T1

In [17]:
df_types['posting_date'][0].day

AttributeError: 'str' object has no attribute 'day'

In [18]:
pd.to_datetime(df_types['posting_date'])

  pd.to_datetime(df_types['posting_date'])


0       2021-04-17 12:30:50-04:00
1       2021-05-03 15:40:21-04:00
2       2021-04-28 03:52:20-07:00
3       2021-04-17 10:08:57-04:00
4       2021-05-03 18:32:06-04:00
                  ...            
9995    2021-04-10 16:33:57-04:00
9996    2021-05-03 09:36:30-04:00
9997    2021-04-22 12:14:01-07:00
9998    2021-04-14 09:14:42-05:00
9999    2021-04-24 13:50:49-04:00
Name: posting_date, Length: 10000, dtype: object

In [19]:
pd.to_datetime(df_types['posting_date'], utc=True)

0      2021-04-17 16:30:50+00:00
1      2021-05-03 19:40:21+00:00
2      2021-04-28 10:52:20+00:00
3      2021-04-17 14:08:57+00:00
4      2021-05-03 22:32:06+00:00
                  ...           
9995   2021-04-10 20:33:57+00:00
9996   2021-05-03 13:36:30+00:00
9997   2021-04-22 19:14:01+00:00
9998   2021-04-14 14:14:42+00:00
9999   2021-04-24 17:50:49+00:00
Name: posting_date, Length: 10000, dtype: datetime64[ns, UTC]

In [20]:
df_types['date'] = pd.to_datetime(df_types['posting_date'], utc=True)

In [21]:
df_types['date']

0      2021-04-17 16:30:50+00:00
1      2021-05-03 19:40:21+00:00
2      2021-04-28 10:52:20+00:00
3      2021-04-17 14:08:57+00:00
4      2021-05-03 22:32:06+00:00
                  ...           
9995   2021-04-10 20:33:57+00:00
9996   2021-05-03 13:36:30+00:00
9997   2021-04-22 19:14:01+00:00
9998   2021-04-14 14:14:42+00:00
9999   2021-04-24 17:50:49+00:00
Name: date, Length: 10000, dtype: datetime64[ns, UTC]

In [22]:
df_types['date'][0].day

17

In [24]:
df_types.to_csv('./dataset/df_types.csv', index=False)