# Data Analysis on Mobile Phone Features
With technology becoming cheaper and more accessible, multiple new faces emerged into the mobile phone market - each offering a unique set of features at affordable prices.

Given a price range, today's customers have too wide a variety of selections to choose from. The options are endless, and choosing the **best** product becomes unnecessarily complicated.

To solve this problem, I present this analysis, which will help potential customers simplify their selection process by **classifying the best phones** into different categories as follows:
- **The Daily Driver** -> For people who want reliable phones for daily use
- **The Cameraman** -> For casual photographers and videographers who use mobile phones as part of their workflow
- **The Performer** -> For gamers and others who require performance over everything else
- **The Monk** -> For people who want something simple, yet robust, without all the *smart* stuff

Based on their requirements, customers can now narrow their options down to the best ones.

## Explaining the data
The dataset used to perform this analysis contains data of 1000 different mobile phones - in a similar price range - along with all their features.

The dataset used in this analysis [can be found here](https://www.kaggle.com/iabhishekofficial/mobile-price-classification)

To further understand the data and perform exploratory analysis, we have to examine dataset. To do this, we use the `pandas` library, which is part of the Python programming language. The `pandas` library contains useful functions for representing, analysing and visualising data.

For more information on the `pandas` library [refer this link](https://pandas.pydata.org/)

In [1]:
import pandas as pd
dataframe = pd.read_csv('dataset.csv')

We use the built-in `import` Python command to import the `pandas` library into our project. We use the `as` command to create an alias for `pandas`, which can be used to call the methods of `pandas`

We use the command `pd.read_csv()` to take the data present in `dataset.csv` and store it in a variable called `dataframe`

This variable is used to represent the entire dataset and to perform further anaysis

In [2]:
dataframe

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,...,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,1,1043,1,1.8,1,14,0,5,0.1,193,...,16,226,1412,3476,12,7,2,0,1,0
1,2,841,1,0.5,1,4,1,61,0.8,191,...,12,746,857,3895,6,0,7,1,0,0
2,3,1807,1,2.8,0,1,0,27,0.9,186,...,4,1270,1366,2396,17,10,10,0,1,1
3,4,1546,0,0.5,1,18,1,25,0.5,96,...,20,295,1752,3893,10,0,7,1,1,0
4,5,1434,0,1.4,0,11,1,49,0.5,108,...,18,749,810,1773,15,8,7,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,1700,1,1.9,0,0,1,54,0.5,170,...,17,644,913,2121,14,8,15,1,1,0
996,997,609,0,1.8,1,0,0,13,0.9,186,...,2,1152,1632,1933,8,1,19,0,1,1
997,998,1185,0,1.4,0,1,1,8,0.5,80,...,12,477,825,1223,5,0,14,1,0,0
998,999,1533,1,0.5,1,0,0,50,0.4,171,...,12,38,832,2509,15,11,6,0,1,0


By calling `dataframe` we get access to our data

Now that we've represented the data, it is time to explain it. The dataset contains 1000 rows and 21 columns. The rows represent the different mobile phones while the columns represent the different features offered by each.

As there are too many columns, we cannot see the entire list. Hence, we need to print out all the different columns available in our dataset so that we may look into all the features

In [3]:
dataframe.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc',
       'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi'],
      dtype='object')

By using `.columns` we get a list of all the columns available in the dataset. Using this list, we can identify the features of each phone.

### The Features of each phone (column headers)

- `battery_power` -> The battery capacity in mAH (higher values indicate better battery life)
- `blue`-> Whether the phone supports Bluetooth (The value `1` indicates that it supports Bluetooth)
- `clock_speed` -> How fast the processor computes tasks (higher value indicates better performance)
- `dual_sim` -> Whether the phone supports dual sim (The value `1` indicates that it supports dual sim)
- `fc` -> The pixels of the front camera (higher value indicates better quality selfies)
- `four_g` -> Whether the phone supports 4G (The value `1` indicates that it supports 4G)
- `int_memory` -> The internal memory of the phone (higher value indicates more storage)
- `m-dep` -> Indicates the depth of the phone
- `mobile_wt` -> Indicates the weight of the phone
- `n_cores` -> Represents the number of cores (higher value indicates better efficiency)
- `pc` -> The pixels of the primary camera (higher value indicates better photos and videos)
- `px_height` -> The pixel height of the phone
- `px_width` -> The pixel width of the phone
- `ram` -> The amount of temporary memory available (higher value indicates better multi tasking)
- `sc_h` -> The screen height
- `sc_w` -> The screen width
- `talk_time` -> The longest time the battery will last after a full charge (higher value indicates more battery life)
- `three_g` -> Whether the phone supports 3G (The value `1` indicates that it supports 3G)
- `touch_screen` -> Whether the phone has a touch screen (The value `1` indicates that it has a touch screen)
- `wifi` -> Whether the phone supports Wi-Fi (The value `1` indicates that it supports Wi-Fi)

## Data Cleanup

Now that we've understood the data, it is time to do a cleanup. The cleanup process involves:
1. Checking for missing or inconsistent data
2. Removing/modifying irrelevant columns
3. Removing indexes that do not satisfy certain conditions

### 1) Checking for missing or inconsistent data

In any given dataset, there is a probability of missing or inconsistent data

**Missing data** pertains to any data that should have be present in the dataset, but is not. Depending on the dataset, there are several solutions to this problem:
- Remove the entire row containing the missing data
- Replace the missing data with the average value along the entire dataset
- Replace the missing data with the average value of it's successor and predecessor
- Replace the missing data with a random value in a particular range

**Inconsistent data** occurs when similar data is kept in different formats along different files. Care should be taken in maintaining data integrity when dealing with multiple files.

To detect missing values in a dataset we use `.isna()`, which returns `True` if data is missing

In [4]:
dataframe.isna()

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,m_dep,mobile_wt,...,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,three_g,touch_screen,wifi
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
996,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
997,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
998,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


From the operation performed, it _seems like_ there are no missing values since `dataframe.isna()` returned `False`

But this output is a condensed version of the actual output. This result doesn't guarantee the presence of all values.

To dig deeper, we use the `any()` operation, which returns `True` along an axis if an element is present. This operation is performed twice in conjuction to check along both axis simultaneously

In [5]:
dataframe.isna().any().any()

False

**Since the output of this operation is `False`, we can conclude that there are no missing values in the given dataset.** If instead, the above operation returned `True`, then we would have to cleanup the dataset to fix the missing values

**And since all data is stored in a single file, it is consistent**

### 2) Removing/modifying irrelevant columns

Given a dataset, it is important to look into every attribute to check whether it provides useful information. Chances are that every database contains irrelevant attributes or columns that do not give us any insights. Removing or/and modifying these attributes will result in a more compact and cleaner dataset.

So once again, we look at the columns of our dataset

In [6]:
dataframe.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc',
       'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi'],
      dtype='object')

Observing these attributes, we find the `three_g` attribute to be obsolete. 3G cannot meet the high speed demands of today. Hence, we remove the `three_g` attribute because it is irrelevant

In [7]:
del dataframe['three_g']
dataframe.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc',
       'px_height', 'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time',
       'touch_screen', 'wifi'],
      dtype='object')

By using the `del` command and by specifying the column name, the `three_g` attribute is now removed. The command `dataframe.columns` shows the successful deletion of the attribute

Similarly, the attribute `m_dep` that stands for mobile depth is also removed. Mobile depth is a factor that is almost never considered by a potential customer and thus, is not required in our dataset

In [8]:
del dataframe['m_dep']
dataframe.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'touch_screen', 'wifi'],
      dtype='object')

We've successfully removed `m_dep` as indicated above by the `dataframe.columns` command

#### Merging Attributes together
Observing the dataset once again, the `px_height` and `px_width` attributes can be merged together to create a new attribute `resolution`, which gives us the display resolution of the mobile phone. This information is more useful than the separate height and width

**To create the new `resolution` attribute,** we use the values in `px_height` and `px_width` along with the **screen size of the mobile phone** to find the PPI ([Pixels Per Inch](https://en.wikipedia.org/wiki/Pixel_density)), which is a common metric used for resolution

`PPI = diagonal length (pixels) / diagonal length (inches)`

Diagonal lengths (both in pixels and inches) can be found by taking the [Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance) of the height and width

In [9]:
import numpy as np

# creating four numpy arrays containing the pixel height, pixel width, screen height and screen width respectively
pixelHeight = dataframe['px_height'].to_numpy()
pixelWidth = dataframe['px_width'].to_numpy()
screenHeight = dataframe['sc_h'].to_numpy()
screenWidth = dataframe['sc_w'].to_numpy()

# calculating the diagonal values in pixels and inches
diagPix = pow(pixelHeight*pixelHeight + pixelWidth*pixelWidth,1/2)
diagInch = pow(screenHeight*screenHeight + screenWidth*screenWidth,1/2)

# calculating PPI using the formula given above
ppi = diagPix/diagInch

# adding the PPI values to a the attribute 'resolution'
dataframe['resolution'] = np.floor(ppi).tolist()

dataframe

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,mobile_wt,n_cores,pc,px_height,px_width,ram,sc_h,sc_w,talk_time,touch_screen,wifi,resolution
0,1,1043,1,1.8,1,14,0,5,193,3,16,226,1412,3476,12,7,2,1,0,102.0
1,2,841,1,0.5,1,4,1,61,191,5,12,746,857,3895,6,0,7,0,0,189.0
2,3,1807,1,2.8,0,1,0,27,186,3,4,1270,1366,2396,17,10,10,1,1,94.0
3,4,1546,0,0.5,1,18,1,25,96,8,20,295,1752,3893,10,0,7,1,0,177.0
4,5,1434,0,1.4,0,11,1,49,108,6,18,749,810,1773,15,8,7,0,1,64.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,1700,1,1.9,0,0,1,54,170,7,17,644,913,2121,14,8,15,1,0,69.0
996,997,609,0,1.8,1,0,0,13,186,4,2,1152,1632,1933,8,1,19,1,1,247.0
997,998,1185,0,1.4,0,1,1,8,80,1,12,477,825,1223,5,0,14,0,0,190.0
998,999,1533,1,0.5,1,0,0,50,171,2,12,38,832,2509,15,11,6,1,0,44.0


Finding PPI (Pixels Per Inch) requires certain calculations as per the formula. These calculations are hard to perform using `pandas`. Fortunately, there is a Python library called `numpy` that is specifically used for numerical computations and which also integrates seamlessly with `pandas`

[Click here for more information on numpy](https://numpy.org/)

To use the `numpy` library, we use the command `import numpy as np`, where `np` is the alias used to access the different `numpy` methods.

Four `numpy` arrays are created whose values are equal to the attributes:
- `px_height`
- `px_width`
- `sc_h`
- and `sc_w` respectively

Using these arrays, we can perform the Euclidean operation to find the diagonal lengths in both pixels (`diagPix`) and inches (`diagInch`). These diagonal lengths are also `numpy` arrays

Once these values are calculated, PPI is simply the quotient of `diagPix` over `diagInch`. A `numpy` floor operation - `np.floor()` - eliminates the decimals before adding the values to the new attribute `resolution`

The next step is removing the attributes `px_height` and `px_width`, which are now redundant

In [10]:
del dataframe['px_height'], dataframe['px_width']
dataframe.columns

Index(['id', 'battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc',
       'four_g', 'int_memory', 'mobile_wt', 'n_cores', 'pc', 'ram', 'sc_h',
       'sc_w', 'talk_time', 'touch_screen', 'wifi', 'resolution'],
      dtype='object')

### 3) Removing indexes that do not satisfy certain conditions

It is now time to look at the indexes of the dataset - namely the rows. Our dataset contains 1000 elements - mobile phones along with their features - but not all of them are satisfactory. That is, there are some elements (mobile phones) in the dataset that do not satisfy the basic requirements of a customer. 

For example: **WiFi, Mobile Data and Bluetooth.**

In the modern world, these three features are expected to be in every phone. Customers do not even consider them because it is a given that they are present in every phone. **Hence, those phones not having these features are removed.**

In [11]:
dataframe[['blue','four_g','wifi']]

Unnamed: 0,blue,four_g,wifi
0,1,0,0
1,1,1,0
2,1,0,1
3,0,1,0
4,0,1,1
...,...,...,...
995,1,1,0
996,0,0,1
997,0,1,0
998,1,0,0


By examining the `blue`, `four_g` and `wifi` attributes, we find that their presence or lack of it is represented by the values `1` and `0` respectively. **Now, we can select, and assign those elements having the value `1` into a new dataframe**

In [12]:
# removing the elements that do not have Bluetooth
nD1 = dataframe[dataframe.blue !=0]

# removing the elements that do not have WiFi
nD2 = nD1[nD1.wifi !=0]

# removing the elements that do no support Mobile Data
dataframe = nD2[nD2.four_g !=0]

dataframe

Unnamed: 0,id,battery_power,blue,clock_speed,dual_sim,fc,four_g,int_memory,mobile_wt,n_cores,pc,ram,sc_h,sc_w,talk_time,touch_screen,wifi,resolution
5,6,1464,1,2.9,1,5,1,50,198,8,9,3506,10,7,3,1,1,89.0
15,16,1846,1,1.0,0,5,1,53,106,8,7,563,9,5,10,0,1,178.0
18,19,1231,1,1.7,1,2,1,37,194,2,3,3902,19,12,15,0,1,78.0
56,57,1272,1,0.5,0,9,1,54,133,8,11,3181,10,6,14,0,1,223.0
63,64,1634,1,2.3,1,2,1,39,164,1,7,2167,12,0,20,1,1,61.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
956,957,937,1,1.5,0,2,1,44,133,4,6,493,9,7,15,0,1,175.0
960,961,710,1,2.3,1,1,1,24,153,2,3,3678,18,4,18,1,1,138.0
967,968,636,1,2.6,1,7,1,55,183,2,14,3558,10,7,10,1,1,144.0
993,994,567,1,2.7,1,14,1,56,165,8,17,336,7,6,7,1,1,152.0


By using intermediate dataframes - `nD1` and `nD2` - we formulate `dataframe` once again. This time, `dataframe` contains all the required features. Namely: Bluetooth, WiFi and Mobile Data

This brings down the dataset to just 126 elements from 1000 (87% reduction). **Having a smaller dataset without losing information comes with a few advantages:**
- Faster computation
- Takes less memory
- Helps in making better decisions
- Easy to analyse

The intermediate dataframes `nD1` and `nD2` are deleted because they've finished serving their purpose. By deleting them, we can save memory

In [13]:
del nD1, nD2

#### Removing newly irrelevant columns

By looking at the `blue`, `wifi` and `four_g` columns of the newly formulated `dataframe`, we find that all values of those three attributes are `1` (this is in line with the operation we performed above)

As such, these columns are redundant - all the elements (mobile phones) in the dataset contains these three features as denoted by the value `1` - and is removed

In [14]:
del dataframe['blue'], dataframe['wifi'], dataframe['four_g']
dataframe.columns

Index(['id', 'battery_power', 'clock_speed', 'dual_sim', 'fc', 'int_memory',
       'mobile_wt', 'n_cores', 'pc', 'ram', 'sc_h', 'sc_w', 'talk_time',
       'touch_screen', 'resolution'],
      dtype='object')

Using the `.columns` command we confirm that the required columns are deleted

The final step is to reindex the dataframe

In [15]:
dataframe.reset_index(drop=True, inplace=True)
dataframe

Unnamed: 0,id,battery_power,clock_speed,dual_sim,fc,int_memory,mobile_wt,n_cores,pc,ram,sc_h,sc_w,talk_time,touch_screen,resolution
0,6,1464,2.9,1,5,50,198,8,9,3506,10,7,3,1,89.0
1,16,1846,1.0,0,5,53,106,8,7,563,9,5,10,0,178.0
2,19,1231,1.7,1,2,37,194,2,3,3902,19,12,15,0,78.0
3,57,1272,0.5,0,9,54,133,8,11,3181,10,6,14,0,223.0
4,64,1634,2.3,1,2,39,164,1,7,2167,12,0,20,1,61.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,957,937,1.5,0,2,44,133,4,6,493,9,7,15,0,175.0
122,961,710,2.3,1,1,24,153,2,3,3678,18,4,18,1,138.0
123,968,636,2.6,1,7,55,183,2,14,3558,10,7,10,1,144.0
124,994,567,2.7,1,14,56,165,8,17,336,7,6,7,1,152.0


This is the dataframe after the cleanup process. The original dataframe of 1000 x 21 elements is reduced to a dataframe containing 126 x 15 elements by:
1. Correcting missing and inconsistent data
2. Removing and modifying irrelevant columns
3. Removing indexes that do not satisfy basic requirements

We use this new dataframe to perform further analysis on the dataset

## Exploratory Analysis