<h1><center>Neural networks with Keras y Scikit</center></h1>


<center><i>Linear regression: Predicting the best price for cars sales</i></center>

In [1]:
import pandas as pd
import numpy as np
import io
from google.colab import files

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
%cd '/content/drive/My Drive/Colab Notebooks/curso-redes-neuronales/proyecto-del-curso'
%ls

/content/drive/My Drive/Colab Notebooks/curso-redes-neuronales/proyecto-del-curso
 cars.parquet
'Copia de Proyecto_Precios_Vehiculos_Usados_BLANCO.ipynb'
 craiglist_cars.csv
 data-engineer.ipynb
 design-training-and-evaluation


# <h1 id="exploration">Data exploration</h1>


In [4]:
cars = pd.read_csv('craiglist_cars.zip')
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 434542 entries, 0 to 434541
Data columns (total 12 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   year          434542 non-null  int64 
 1   manufacturer  418698 non-null  object
 2   condition     274367 non-null  object
 3   cylinders     311539 non-null  object
 4   fuel          430894 non-null  object
 5   title_status  431661 non-null  object
 6   transmission  430244 non-null  object
 7   drive         376834 non-null  object
 8   size          181927 non-null  object
 9   type          384280 non-null  object
 10  paint_color   348787 non-null  object
 11  price         434542 non-null  int64 
dtypes: int64(2), object(10)
memory usage: 39.8+ MB


In [5]:
cars.describe()

Unnamed: 0,year,price
count,434542.0,434542.0
mean,2008.988484,12082.363571
std,9.120415,10346.560358
min,1900.0,0.0
25%,2006.0,3999.0
50%,2011.0,9495.0
75%,2015.0,17881.0
max,2020.0,50934.0


In [6]:
cars.shape

(434542, 12)

# NaN values remove

In [7]:
# NaN percent values
nan_percent_values = dict(cars.isna().sum()/cars.shape[0] * 100)
nan_percent_values

{'condition': 36.860648682981164,
 'cylinders': 28.3063547367113,
 'drive': 13.280189256734676,
 'fuel': 0.8395045818355878,
 'manufacturer': 3.6461377726433803,
 'paint_color': 19.73457111165319,
 'price': 0.0,
 'size': 58.13362114594216,
 'title_status': 0.6629969024858356,
 'transmission': 0.9890873609455472,
 'type': 11.566660990191972,
 'year': 0.0}

For this case, only columns with less than 10% NaN values would be replaced by the mode

In [12]:
def nan_to_mode(df, nan_values_dict):
  values = list(nan_values_dict.values())
  keys = list(nan_values_dict.keys())
  
  for i in values:
      if i < float(10) and i != float(0):
        column = keys[values.index(i)]
        print(f"'{column}' has {round(i,2)}% of NaN values, I will replace it by the mode. \n")
        df[column] = df[column].fillna(df[column].mode()[0])
  return df

In [13]:
working_df = nan_to_mode(cars, nan_percent_values)

'manufacturer' has 3.65% of NaN values, I will replace it by the mode. 

'fuel' has 0.84% of NaN values, I will replace it by the mode. 

'title_status' has 0.66% of NaN values, I will replace it by the mode. 

'transmission' has 0.99% of NaN values, I will replace it by the mode. 



In [15]:
working_df.isna().sum()/working_df.shape[0] * 100

year             0.000000
manufacturer     0.000000
condition       36.860649
cylinders       28.306355
fuel             0.000000
title_status     0.000000
transmission     0.000000
drive           13.280189
size            58.133621
type            11.566661
paint_color     19.734571
price            0.000000
dtype: float64

# Data types and transformations

In [16]:
types = pd.DataFrame(data=working_df.dtypes, columns=['Variable types'])
types.groupby('Variable types').size()

Variable types
int64      2
object    10
dtype: int64

In [17]:
objects_list = list(types[types.values == 'object'].index)
objects_list

['manufacturer',
 'condition',
 'cylinders',
 'fuel',
 'title_status',
 'transmission',
 'drive',
 'size',
 'type',
 'paint_color']

## Categorial variables

In [18]:
working_df[objects_list].nunique()

manufacturer    42
condition        6
cylinders        8
fuel             5
title_status     6
transmission     3
drive            3
size             4
type            13
paint_color     12
dtype: int64

👉 Looking for nonsignifical data tagged as 'other'

In [19]:
for objects in objects_list:
  evaluation = list(working_df[objects].str.contains('other').value_counts().index)
  if len(evaluation) == 2:
    print(f"{objects} may have nonsignifical data")

cylinders may have nonsignifical data
fuel may have nonsignifical data
transmission may have nonsignifical data
type may have nonsignifical data


In [20]:
working_df['type'].value_counts()

sedan          99294
SUV            97372
pickup         49183
truck          46112
coupe          22359
hatchback      13960
other          13815
wagon          11982
convertible    10754
van             9985
mini-van        7818
offroad          988
bus              658
Name: type, dtype: int64

In [21]:
working_df['fuel'].value_counts()

gas         383278
diesel       31824
other        15030
hybrid        3740
electric       670
Name: fuel, dtype: int64

In [22]:
working_df['cylinders'].value_counts()

6 cylinders     112997
8 cylinders      96255
4 cylinders      95502
5 cylinders       2878
10 cylinders      1819
other             1450
3 cylinders        511
12 cylinders       127
Name: cylinders, dtype: int64

In [23]:
working_df['transmission'].value_counts()

automatic    389372
manual        33445
other         11725
Name: transmission, dtype: int64

In [24]:
working_df.columns

Index(['year', 'manufacturer', 'condition', 'cylinders', 'fuel',
       'title_status', 'transmission', 'drive', 'size', 'type', 'paint_color',
       'price'],
      dtype='object')

👉 The category **'other'** in cylinders, transmission, type and fuel labels it does not apport significance to the analysis, so this data would be **dropped**. 

## One hot encoding

In [25]:
def OneHotEncoding_df(df, column):
  OHE_df = pd.get_dummies(column+'_'+df[column])
  return OHE_df

for category in objects_list:
  mask = OneHotEncoding_df(working_df, category)
  print(f'column {category} transformed!')
  working_df.drop(category, axis=1, inplace=True)
  working_df = pd.concat([working_df, (mask).astype(int)], axis=1)

print(f"Dataset final size: {working_df.shape}")

column manufacturer transformed!
column condition transformed!
column cylinders transformed!
column fuel transformed!
column title_status transformed!
column transmission transformed!
column drive transformed!
column size transformed!
column type transformed!
column paint_color transformed!
Dataset final size: (434542, 104)


In [26]:
working_df.drop(['cylinders_other', 'transmission_other', 'fuel_other', 'type_other'], axis=1, inplace=True)

In [27]:
working_df.shape

(434542, 100)

In [28]:
types = pd.DataFrame(data=working_df.dtypes, columns=['Variable types'])
types.groupby('Variable types').size()

Variable types
int64    100
dtype: int64

# Exporting dataframe

In [None]:
working_df.to_parquet('cars.parquet')