# Fixing `cyl` Data Type
- 2008: extract int from string
- 2018: convert float to int

Load datasets `data_08_v2.csv` and `data_18_v2.csv`. You should've created these data files in the previous section: *Filter, Drop Nulls, Dedupe*.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

%config IPCompleter.greedy = True

In [2]:
import pandas as pd
import numpy as np

In [3]:
# load datasets
df_08 = pd.read_csv('data_08_v2.csv')
df_08.head( n = 2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA MDX,3.7,(6 cyl),Auto-S5,4WD,Gasoline,SUV,7,15,20,17,4,no
1,ACURA RDX,2.3,(4 cyl),Auto-S5,4WD,Gasoline,SUV,7,17,22,19,5,no


In [4]:
df_18 = pd.read_csv('data_18_v2.csv')
df_18.head( n = 2)

Unnamed: 0,model,displ,cyl,trans,drive,fuel,veh_class,air_pollution_score,city_mpg,hwy_mpg,cmb_mpg,greenhouse_gas_score,smartway
0,ACURA RDX,3.5,6.0,SemiAuto-6,2WD,Gasoline,small SUV,3,20,28,23,5,No
1,ACURA RDX,3.5,6.0,SemiAuto-6,4WD,Gasoline,small SUV,3,19,27,22,4,No


> #### Fix cyl datatype

##### Step: 1 - 2008: extract int from string.

In [5]:
# check value counts for the 2008 cyl column
df_08['cyl'].value_counts()

(6 cyl)     409
(4 cyl)     283
(8 cyl)     199
(5 cyl)      48
(12 cyl)     30
(10 cyl)     14
(2 cyl)       2
(16 cyl)      1
Name: cyl, dtype: int64

In [6]:
# Now we are checking for null values, and getting sum of null values
df_08_cyl_null = df_08['cyl'].isnull()
df_08_cyl_null

# Sum of null values
df_08_cyl_null.sum()

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29     False
       ...  
956    False
957    False
958    False
959    False
960    False
961    False
962    False
963    False
964    False
965    False
966    False
967    False
968    False
969    False
970    False
971    False
972    False
973    False
974    False
975    False
976    False
977    False
978    False
979    False
980    False
981    False
982    False
983    False
984    False
985    False
Name: cyl, Length: 986, dtype: bool

0

Read [this](https://stackoverflow.com/questions/35376387/extract-int-from-string-in-pandas) to help you extract ints from strings in Pandas for the next step.

In [7]:
# Extract int from strings in the 2008 cyl column

# type of cyl column, Before Extract int from strings
type(df_08.loc[0,'cyl'])

df_08['cyl'] =  df_08['cyl'].str[1:-4].astype('int64')

# type of cyl column, After Extract int from strings
type(df_08.loc[0,'cyl'])

str

numpy.int64

In [8]:
# Check value counts for 2008 cyl column again to confirm the change
df_08['cyl'].value_counts()

6     409
4     283
8     199
5      48
12     30
10     14
2       2
16      1
Name: cyl, dtype: int64

##### Step: 2 - 2018: convert float to int.

In [9]:
# Check Dtype for df_18.loc[0, 'cyl']
df_2018_cyl = df_18.loc[0, 'cyl']
df_2018_cyl
type(df_2018_cyl)

6.0

numpy.float64

In [10]:
# Sum of null values in df_18['cyl']
df_18['cyl'].isnull().sum()

0

In [11]:
df_18['cyl'].astype('int64')

0      6
1      6
2      4
3      6
4      6
5      6
6      6
7      4
8      6
9      4
10     4
11     4
12     4
13     4
14     4
15     4
16     4
17     4
18     4
19     4
20     4
21     4
22     6
23     6
24     6
25     4
26     4
27     4
28     4
29     6
      ..
764    6
765    4
766    4
767    4
768    4
769    4
770    4
771    4
772    4
773    4
774    4
775    4
776    4
777    4
778    4
779    4
780    4
781    4
782    4
783    4
784    4
785    4
786    4
787    4
788    4
789    4
790    4
791    4
792    4
793    4
Name: cyl, Length: 794, dtype: int64

In [12]:
df_08.to_csv('data_08_v3.csv', index=False)
df_18.to_csv('data_18_v3.csv', index=False)