#  Work on Data preparation, visualization and Machine learning with Python
## Data Science in Production

<img src='https://www.uniquindio.edu.co/info/uniquindio/media/bloque2.png' width="90" height="110" >

**Students:** <br>

Humberto Franco Osorio - Juan Felipe Padilla


**Email:** hfranoo@uqvirtual.edu.co - jfpadillag@uqvirtual.edu.co


**Teacher:** [Jose R. Zapata](https://joserzapata.github.io) <br>
<br>
<a href='https://joserzapata.github.io'>
    <img src='https://1000marcas.net/wp-content/uploads/2020/02/logo-GitHub.png' width="90" height="50" >
</a>
&nbsp;
<a href='https://twitter.com/joserzapata'>
    <img src='https://upload.wikimedia.org/wikipedia/commons/thumb/4/4f/Twitter-logo.svg/1200px-Twitter-logo.svg.png' width="50" height="40" >
</a>
&nbsp;&nbsp;&nbsp;&nbsp;
<a href='https://www.linkedin.com/in/jose-ricardo-zapata-gonzalez/'>
    <img src='https://cdn-icons-png.flaticon.com/512/174/174857.png' width="50" height="50" >
</a>


## Table of Contents
1) Import Libreries
2) Load Dataset
3) Data description and cleaning

    - Variable identification
    - Missing  and  Duplicate values
        - Identify missing values
        - Identify duplicated
        - Dataset cleaning
    - Elimination of features that do not provide information
4) Save intermediated transformed data
5) Partial Results

### Introduction
Analyze the relationship between the characteristics of a cell phone and its selling price, with the aim of classifying the price range in which a cell phone is found.

4 different outputs: <br>
&nbsp;&nbsp;&nbsp; **0** -> Low  <br>
&nbsp;&nbsp;&nbsp; **1** -> Medium   <br>
&nbsp;&nbsp;&nbsp; **2** -> High   <br>
&nbsp;&nbsp;&nbsp; **3** -> Very High   <br>

### Main Objective
> What is the price range of a mobile phone according to its features?

## 1. Import Libreries

In [1]:
import pandas as pd
pd.options.plotting.backend = "plotly"
import numpy as np
import re

## 2. Load Dataset

In [2]:
ds = pd.read_csv("../data/raw/CellPhoneDS.csv")

## 3. Data description and cleaning

### Variable identification

<center>

_**Table 1.** List of features and their descriptions in the initial dataset._

|     | Feature name  | Description and values |  Type| % missing |
|:---:|     :---:     | :---: |:---: |:---: | 
| 1 | Unnamed: 0    | Unique identifier of registered mobile phone. | Numeric |  0
| 2 | index         | Unique mobile phone model identifier. | Nominal | 0
| 3 | battery_power | Total energy a battery can store in one time measured in mAh.     | Numeric | 3.77
| 4 | blue          | Has bluetooth or not. <br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES.| Nominal| 4.02
| 5 | clock_speed   | Speed at which microprocessor executes instructions.<br> 31 distinct values. | Numeric| 3.67
| 6 | dual_sim      | Has dual sim support or not.<br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES. | Nominal | 3.54
| 7 | fc            | Front Camera mega pixels. | Numeric| 4.00
| 8 | four_g        | Has 4G or not. <br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES.| Nominal | 3.96
| 9 | int_memory    | Internal Memory in Gigabytes. | Numeric | 3.86
| 10 | m_dep         | Mobile Depth in cm. | Numeric  | 3.79
| 11 | mobile_wt     | Weight of mobile phone. | Numeric | 3.67
| 12 | n_cores       | Number of cores of processor. | Numeric| 3.71
| 13 | pc            | Primary Camera mega pixels. | Numeric| 3.54
| 14 | px_height     | Pixel Resolution Height. | Numeric | 3.86
| 15 | px_width      | Pixel Resolution Width | Numeric | 3.77
| 16 | ram           | Random Access Memory in Megabytes.| Numeric  | 4.02
| 17 | sc_h          | Screen Height of mobile in cm.| Numeric  | 3.87
| 18 | sc_w          | Screen Width of mobile in cm. | Numeric  | 3.58
| 19 | talk_time     | Longest time that a single battery charge will last when you are charget.  | Numeric | 4.01
| 20 | three_g       | Has 3G or not. <br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES.| Nominal | 3.64
| 21 | touch_screen  | Has touch screen or not. <br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES.| Nominal | 3.56
| 22 | wifi          | Has wifi or not. <br>0: NO. &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  1: YES.| Nominal | 3.64
| 23 | hgv           | <font color='yellow'>Feature with no information, no information can be collected for this feature.</font> | Numeric | 3.74
| 24 | werf          | <font color='yellow'>Feature with no information, no information can be collected for this feature.</font> | Nominal | 0
| 25 | price_range   | Price range is given by categories.<br> &nbsp;&nbsp;&nbsp; **0** -> Low  <br>&nbsp;&nbsp;&nbsp; **1** -> medium   <br>&nbsp;&nbsp;&nbsp; **2** -> High   <br>&nbsp;&nbsp;&nbsp; **3** -> Very High   <br> | Ordinal | 4.58

</center>

The initial working dataset has 25 columns (feature) and 74790 registered mobile phones.

In [3]:
ds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74790 entries, 0 to 74789
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     74790 non-null  int64 
 1   index          74790 non-null  int64 
 2   talk_time      71990 non-null  object
 3   battery_power  71970 non-null  object
 4   pc             72140 non-null  object
 5   three_g        72070 non-null  object
 6   mobile_wt      72050 non-null  object
 7   px_width       71970 non-null  object
 8   sc_h           71895 non-null  object
 9   sc_w           72115 non-null  object
 10  m_dep          71955 non-null  object
 11  touch_screen   72130 non-null  object
 12  fc             71800 non-null  object
 13  four_g         71825 non-null  object
 14  hgv            71990 non-null  object
 15  price_range    71365 non-null  object
 16  blue           71785 non-null  object
 17  n_cores        72015 non-null  object
 18  wifi           72070 non-n

### Missing  and  Duplicate values

#### A) Identify missing values

In [4]:
ds.sample(10).T

Unnamed: 0,20376,63331,26166,22050,39673,18913,38332,23960,12456,28821
Unnamed: 0,20376.0,63331.0,26166.0,22050.0,39673,18913,38332.0,23960.0,12456,28821.0
index,948.0,465.0,2460.0,130.0,60,667,643.0,61.0,269,183.0
talk_time,5285988458456.0,9.0,17.0,7.0,20.0,8.0,2.0,6.0,6.0,6.0
battery_power,1631.0,1583.0,,-948961565145.0,1484.0,nhbgvfrtd 56gyub,920.0,799.0,1281.0,959.0
pc,16.0,10.0,4.0,-948961565145.0,5.0,7.0,3.0,6.0,6.0,19.0
three_g,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
mobile_wt,166.0,118.0,,-948961565145.0,nhbgvfrtd 56gyub,nhbgvfrtd 56gyub,149.0,144.0,148.0,84.0
px_width,1735.0,862.0,,829.0,969.0,nhbgvfrtd 56gyub,1421.0,975.0,1617.0,1631.0
sc_h,12.0,14.0,12.0,17.0,nhbgvfrtd 56gyub,12.0,6.0,15.0,12.0,16.0
sc_w,3.0,10.0,9.0,7.0,4.0,7.0,0.0,,1.0,1.0


The presence of values such as '??????', 'nhbgvfrtd 56gyub', or '-948961565145.0' is observed, as well as values for some features that are totally atypical considering the description of the feature, is important to find which values had every column.

Get unique values per columns.

In [5]:
print("-----------------------------------------------------------------------------------")
for col in ds:
    print(f"{col} = {ds[col].unique()}")
    print("Number option",f"{len(ds[col].unique())}")
    print("-----------------------------------------------------------------------------------")

-----------------------------------------------------------------------------------
Unnamed: 0 = [    0     1     2 ... 74787 74788 74789]
Number option 74790
-----------------------------------------------------------------------------------
index = [1517  911 1493 ... 1192 1495 1181]
Number option 2493
-----------------------------------------------------------------------------------
talk_time = ['11.0' '9.0' '3.0' '-948961565145.0' '6.0' 'nhbgvfrtd 56gyub' '12.0'
 '8.0' '16.0' '20.0' '13.0' '5285988458456.0' '14.0' '7.0' nan '19.0'
 '10.0' '15.0' '4.0' '18.0' '??????' '17.0' '5.0' '2.0']
Number option 24
-----------------------------------------------------------------------------------
battery_power = ['911.0' '1284.0' '1183.0' ... '527.0' '1542.0' '846.0']
Number option 1099
-----------------------------------------------------------------------------------
pc = ['4.0' '14.0' 'nhbgvfrtd 56gyub' '9.0' '11.0' '20.0' '1.0' '??????' '0.0'
 '16.0' '7.0' '5.0' '5285988458456.0' '10.0' 

From a previous analysis of the dataset it was observed that most of the characteristics data are numerical, so a search for missing values is performed by looking for non-numerical values.

In [6]:
atipic_value=[]
for col in ds:
    for i in range(len(ds)):
        if ds[col].dtype == object:
            line = re.sub(r'[a-zA-z/?/-]', 'nan', str(ds[col][i]))
            if re.search("nan", line):
                atipic_value.append(ds[col][i])
missing_value = np.array(atipic_value)
missing_value = np.unique(missing_value)
print(f'Meaningless values in dataset: {str(missing_value)[1:-2]}')

Meaningless values in dataset: '-948961565145.0' '??????' 'nan' 'nhbgvfrtd 56gyub


Analyzing the presence of these values in the different characteristics of the dataset.

In [7]:
print("-------------------------------------------------------------")
for i in range(len(missing_value)):
    print("Valores iguales a ",missing_value[i])
    frec=ds[ds == missing_value[i]].count()
    for col in ds:
        if frec[col] > 0:
            print(f"{col} = {frec[col]}")
print("-------------------------------------------------------------")

-------------------------------------------------------------
Valores iguales a  -948961565145.0
talk_time = 2495
battery_power = 2495
pc = 2495
three_g = 2495
mobile_wt = 2495
px_width = 2495
sc_h = 2495
sc_w = 2495
m_dep = 2495
touch_screen = 2495
fc = 2495
four_g = 2495
hgv = 2495
price_range = 2495
blue = 2495
n_cores = 2495
wifi = 2495
dual_sim = 2495
ram = 2495
int_memory = 2495
px_height = 2495
clock_speed = 2495
Valores iguales a  ??????
talk_time = 2495
battery_power = 2495
pc = 2495
three_g = 2495
mobile_wt = 2495
px_width = 2495
sc_h = 2495
sc_w = 2495
m_dep = 2495
touch_screen = 2495
fc = 2495
four_g = 2495
hgv = 2495
price_range = 2495
blue = 2495
n_cores = 2495
wifi = 2495
dual_sim = 2495
ram = 2495
int_memory = 2495
px_height = 2495
clock_speed = 2495
Valores iguales a  nan
Valores iguales a  nhbgvfrtd 56gyub
talk_time = 2495
battery_power = 2495
pc = 2495
three_g = 2495
mobile_wt = 2495
px_width = 2495
sc_h = 2495
sc_w = 2495
m_dep = 2495
touch_screen = 2495
fc = 2495
f

There is a low presence of these meaningless vouchers in the different characteristics, therefore it is decided to impute these values as NaN values.

In [8]:
for atipic in missing_value:
    ds.replace(atipic, np.nan, inplace=True)
    
ds.isnull().sum()*100/len(ds)


Unnamed: 0        0.000000
index             0.000000
talk_time        13.751838
battery_power    13.778580
pc               13.551277
three_g          13.644872
mobile_wt        13.671614
px_width         13.778580
sc_h             13.878861
sc_w             13.584704
m_dep            13.798636
touch_screen     13.564648
fc               14.005883
four_g           13.972456
hgv              13.751838
price_range      14.587512
blue             14.025939
n_cores          13.718412
wifi             13.644872
dual_sim         13.544592
ram              14.025939
werf              0.000000
int_memory       13.865490
px_height        13.865490
clock_speed      13.678299
dtype: float64

The percentage of null values in each characteristic increases by approximately 9 points, now there is an average percentage of missing values of 13%, which is a low value considering the size of the dataset, therefore it is decided to eliminate the records with a presence of null values.

In [9]:
(ds.isnull().sum()).plot.bar(title = 'Percentage of Missing Values per column',color=ds.columns)

#### B) Identify duplicated

Knowing that '*_index_*' is a unique indicator of the mobile phone model, it is assumed that mobile phones with the same model number will have the same characteristics.

With this in mind, we now check whether the model number of the cell phone is registered more than once.

In [10]:
print( 'Duplicated Registers: ' + str( ds["index"].duplicated().sum() ))
print( 'Unique Registers: ' + str( len ( ds["index"].unique() ) ) )

Duplicated Registers: 72297
Unique Registers: 2493


There are 2493 unique values in the '*_index_*', which indicates that the dataset has only 2493 mobile phone models and 72297 duplicate records. Therefore these 72297 duplicate records will be eliminated.

#### C) Dataset cleaning
Taking into account the high number of duplicate records, it is decided to eliminate all the records with NaM values, verifying that the number of unique records (2493) is maintained once these records have been eliminated.

In [11]:
ds_copy = ds.copy() #Save copy of original dataframe
ds_copy.dropna(inplace=True,axis=0)
ds_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23790 entries, 0 to 74782
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     23790 non-null  int64 
 1   index          23790 non-null  int64 
 2   talk_time      23790 non-null  object
 3   battery_power  23790 non-null  object
 4   pc             23790 non-null  object
 5   three_g        23790 non-null  object
 6   mobile_wt      23790 non-null  object
 7   px_width       23790 non-null  object
 8   sc_h           23790 non-null  object
 9   sc_w           23790 non-null  object
 10  m_dep          23790 non-null  object
 11  touch_screen   23790 non-null  object
 12  fc             23790 non-null  object
 13  four_g         23790 non-null  object
 14  hgv            23790 non-null  object
 15  price_range    23790 non-null  object
 16  blue           23790 non-null  object
 17  n_cores        23790 non-null  object
 18  wifi           23790 non-n

By eliminating the records with NaN values we were able to reduce the size of the data set to 23790 records.

In [12]:
ds_copy.sample(10).T

Unnamed: 0,74525,12996,37475,30936,61269,17884,36189,40078,41985,6323
Unnamed: 0,74525.0,12996.0,37475.0,30936.0,61269.0,17884.0,36189.0,40078.0,41985.0,6323.0
index,1658.0,487.0,553.0,556.0,795.0,1143.0,2195.0,2453.0,2237.0,2413.0
talk_time,11.0,18.0,18.0,12.0,7.0,19.0,5.0,15.0,15.0,9.0
battery_power,1812.0,1663.0,1544.0,1552.0,1442.0,904.0,928.0,683.0,821.0,1611.0
pc,15.0,14.0,20.0,18.0,3.0,11.0,13.0,8.0,9.0,14.0
three_g,1.0,5285988458456.0,1.0,1.0,1.0,1.0,5285988458456.0,0.0,1.0,1.0
mobile_wt,162.0,169.0,113.0,180.0,145.0,112.0,80.0,197.0,109.0,98.0
px_width,1550.0,1439.0,857.0,658.0,1668.0,1014.0,1243.0,1135.0,1786.0,714.0
sc_h,18.0,7.0,8.0,13.0,11.0,13.0,19.0,9.0,8.0,5285988458456.0
sc_w,5285988458456.0,1.0,7.0,6.0,8.0,3.0,10.0,0.0,4.0,4.0


The presence of value *5285988458456.0* is observed in several fields of the dataset, which corresponds to a meaningless value in the dataset being worked, therefore it is decided to select records that contain this value in one of its characteristics.

In [13]:
ds_copy.replace('5285988458456.0',np.nan,inplace=True)
ds_copy.dropna(inplace=True,axis=0)
ds_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 12130 entries, 0 to 74782
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     12130 non-null  int64 
 1   index          12130 non-null  int64 
 2   talk_time      12130 non-null  object
 3   battery_power  12130 non-null  object
 4   pc             12130 non-null  object
 5   three_g        12130 non-null  object
 6   mobile_wt      12130 non-null  object
 7   px_width       12130 non-null  object
 8   sc_h           12130 non-null  object
 9   sc_w           12130 non-null  object
 10  m_dep          12130 non-null  object
 11  touch_screen   12130 non-null  object
 12  fc             12130 non-null  object
 13  four_g         12130 non-null  object
 14  hgv            12130 non-null  object
 15  price_range    12130 non-null  object
 16  blue           12130 non-null  object
 17  n_cores        12130 non-null  object
 18  wifi           12130 non-n

By eliminating the records with **5285988458456.0** values we were able to reduce the size of the data set to 12130 records.


Now, we proceed to delete duplicate records in the dataset

In [14]:
ds_copy.drop_duplicates(subset ="index", keep = 'first', inplace = True)
ds_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2328 entries, 0 to 14950
Data columns (total 25 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Unnamed: 0     2328 non-null   int64 
 1   index          2328 non-null   int64 
 2   talk_time      2328 non-null   object
 3   battery_power  2328 non-null   object
 4   pc             2328 non-null   object
 5   three_g        2328 non-null   object
 6   mobile_wt      2328 non-null   object
 7   px_width       2328 non-null   object
 8   sc_h           2328 non-null   object
 9   sc_w           2328 non-null   object
 10  m_dep          2328 non-null   object
 11  touch_screen   2328 non-null   object
 12  fc             2328 non-null   object
 13  four_g         2328 non-null   object
 14  hgv            2328 non-null   object
 15  price_range    2328 non-null   object
 16  blue           2328 non-null   object
 17  n_cores        2328 non-null   object
 18  wifi           2328 non-nul

In [15]:
print( 'Duplicated Registers: ' + str( ds_copy["index"].duplicated().sum() ))
print( 'Unique Registers: ' + str( len ( ds_copy["index"].unique() ) ) )
print(f"Total missing values in the dataset: {ds_copy.isnull().sum().sum()}")

Duplicated Registers: 0
Unique Registers: 2328
Total missing values in the dataset: 0


### Elimination of features that do not provide information

The columns _Unnamed:0_ and _index_ are eliminated since they are only numerical identifiers for the cell phone models consulted by Pedro Perez.

The _werf_ column is eliminated since it only has a single value in all the records of the dataset.

In [16]:
ds_copy.drop(['Unnamed: 0', 'index', 'werf'], axis=1, inplace=True)


## 4. Save intermediated transformed data

Data without:
- Duplicates
- Missings values
- Non informative data for our taks

In [17]:
ds_copy.to_parquet('../data/interim/CellPhoneDs_clean.parquet')


## 5. Partial Results

### Initial raw data:
74790 rows, 25 Columns , memory usage: 10.9 MB

### After cleaning process
2328 rows , 22 Columns , memory usage: 110.0 KB

***
