<a href="https://colab.research.google.com/github/honyango/Analog-World-Clock/blob/master/Group3_Data_Cleaning_and_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Data Cleaning and Preprocessing**

## **Context:**

The market for used and refurbished devices has grown significantly over the past decade, as it provides cost-effective alternatives for both consumers and businesses seeking to save money on purchases. By maximizing the longevity of devices through second-hand trade, this market also reduces their environmental impact and aids in recycling and waste reduction.

## **Objective:**
To explore the relationships between device specifications and usage patterns, clean and preprocess the data to ensure consistency and completeness, and prepare it for further analysis to inform strategies in the refurbished device market.


## **Data Description:**

The data contains the different data related to a device. The detailed data dictionary is given below.

## **Data Dictionary**

- **device_brand**: Name of manufacturing brand
- **Operating System** (OS): OS on which the device runs
- **screen_size**: Size of the screen in cm
- **4g**: Whether 4G is available or not
- **5g**: Whether 5G is available or not
- **rear_camera_mp**: Resolution of the rear camera in megapixels
- **front_camera_mp**: Resolution of the front camera in megapixels
- **internal_memory**: Amount of internal memory (ROM) in GB
- **ram**: Amount of RAM in GB
- **battery**: Energy capacity of the device battery in mAh
- **weight**: Weight of the device in grams
- **release_year**: Year when the device model was released
- **days_used**: Number of days the used/refurbished device has been used


### **Importing the necessary libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import pandas as pd

In [None]:
# Libraries to help with reading and manipulating data
def read_dataset(path):
  return pd.read_csv(path)


# Libaries to help with data visualization
def data_info(df):
  return df.info()

### **Brief explaination of the use of each library**

**Numpy:**

Numpy is used for handling Numbers, Numerical analysis. It is the fundamental package for array computing with Python.

**Pandas:**

Pandas are used to process the data. Pandas contain data structures and data manipulation tools designed for data cleaning and analysis.


**matplotlib.pyplot:**

Matplotlib is a visualization library & has been taken from the software `Matlab`. We are only considering one part of this library to show plotting, hence used .pyplot which means python plot.

**Seaborn:**

Seaborn is another visualization library. When it comes to the visualization of statistical models like heat maps, Seaborn is among the reliable sources. This Python library is derived from matplotlib and closely integrated with Pandas data structures.

---

### **Read the dataset**

We use the Pandas library to load the dataset from a CSV file. Pandas provides efficient tools for handling structured data, making it easy to analyze and manipulate.

In [None]:
# Read dataset
def read_dataset(path):
  return pd.read_csv(path)

### **Overview of the dataset**

Show the first 10 records of the dataset. How many columns are there?

In [None]:
# Columns or attributes
def columns(df):
  return df.columns

In [None]:
# Overview %% Columns or attributes %% [markdown] The dimension of the dataset?
# Find the dimension of the dataframe.
#
# - The **shape** of the dataset is a **tuple of 2 elements**. %% Shape of the
# dataset %% [markdown] The size of the dataset? Find the size of the dataframe.
#
# - The **size** of the dataset is the **total number of elements** in the data
# i.e. product of the number of rows and number of columns. %% The size of the
# dataset %% [markdown] ## **Exploratory data analysis**
#
# **Exploratory data analysis** (EDA) is used for analyzing and investigating
# datasets and summarizing their main characteristics, often employing data
# visualization methods.
#
# EDA is an important first step in any data analysis. %% [markdown] What are
# the data types of all the variables in the data set?

The dimension of the dataset? Find the dimension of the dataframe.

- The **shape** of the dataset is a **tuple of 2 elements**.

In [None]:
# Shape of the dataset
def shape(df):
  return df.shape

The size of the dataset? Find the size of the dataframe.

- The **size** of the dataset is the **total number of elements** in the data i.e. product of the number of rows and number of columns.

In [None]:
# The size of the dataset
def size(df):
  return df.size

## **Exploratory data analysis**

**Exploratory data analysis** (EDA) is used for analyzing and investigating datasets and summarizing their main characteristics, often employing data visualization methods.

EDA is an important first step in any data analysis.

What are the data types of all the variables in the data set?

**Hint: Use the `info()` function to get all the information about the dataset.**

In [None]:

# Data info/data types
def data_info(df):
  return df.info()

Write your observations.

**Observations**

-
it shows all the information in the data set
-

What are the qualitative and quatitative variables?

Quantitative Variables → Numerical can be measured or counted

internal_memory, battery

Qualitative Variables → Categorical labels, categories, non-numeric types

brand, model, colour

What do we mean by missing values? Are there any missing values in the dataframe?

In [None]:
# Any missing values?
def missing_values(df):
  return df.isnull().sum()

**Observations:**

-It shows all the information in the data set -

What are the qualitative and quatitative variables?

Quantitative Variables → Numerical can be measured or counted

internal_memory, battery

Qualitative Variables → Categorical labels, categories, non-numeric types

brand, model, colour

-


What do summary statistics of data represent? Find the summary statistics for the numerical variables (Dtype is int64) in the data?


Prediction Models: Regression, machine learning, or deep learning models can be used to predict and replace missing values based on other data.
Handling missing values is an essential part of the data cleaning process in data science. Understanding why data is missing and identifying the mechanism by which it is missing helps in deciding the most appropriate method to handle such missing values.

In [None]:
# Statistics summary

- What is the central tendency of the data?
    - What is the average screen size of the refurbished devices?
    - What is the median battery capacity (in mAh) of the devices?

- How spread out is the data?
    - What's the range (minimum to maximum) of internal memory (ROM) sizes available?
    - What's the standard deviation of device weights?

- Are there any outliers or extreme values?
    - What are the minimum and maximum values for each variable?

- What are the average resolutions for rear and front cameras?

- What's the average number of days these refurbished devices have been used?

**Answers:**
13.713115 Average Mean
-
3000 mah
-Standard deviation weight
88.413228

In [None]:
def summary_statistics(df):
  return df.describe()

In [None]:
def unique_values(df):
  return df.nunique()

In [None]:
# Identify any unique values for categorical attributes
def unique_values(df):
  return df.nunique()

In [None]:
def value_counts(df):
  return df.value_counts()

In [None]:
def value_counts(df):
  return df.value_counts()

In [None]:
date_range = pd.date_range(start='2023-01-01', end='2023-12-31')

In [None]:
def generate_date_range(start_date, end_date):
    date_range = pd.date_range(start=start_date, end=end_date)
    return date_range

## **Data Cleaning**

Data cleaning is a crucial step in the data preprocessing pipeline, with handling missing values being a key component. Here are several techniques to address missing data:

### **1. Handle missing values**

- Deletion: Remove rows with missing values if few.
    - Listwise deletion: Remove entire records (rows) containing any missing values
    - Pairwise deletion: Use available data while ignoring missing data during analysis, particularly useful in statistical calculations like correlation

- Mean/Median/Mode Imputation: Replace with average or middle value or most frequent (for categorical variables).
- Forward fill (ffill) and backward fill (bfill) Imputations: They are imputation techniques that use the values from previous or next observations to fill in the missing values. This can be applied with **time-series** dataset.
- Using Domain Knowledge: In some cases, subject-matter expertise can be used to make educated guesses about missing values.
- Machine Learning Approaches

### When choosing a method to handle missing values, consider:

- The proportion of missing data (e.g., if >30%, consider removing rows).

- The nature of the missing data (completely at random, at random, or not at random).

-  The potential impact on your analysis or model.

- The distribution of the data (e.g., use median for skewed distributions, mean for normal distributions)


In datasets, missing entries might appear as the letter "0", "NA", "NaN", "NULL", "Not Applicable", or "None”.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# 1. Deletion
# Listwise deletion: Remove rows with any missing values
def listwise_deletion(df):
  return df.dropna()


In [None]:
def listwise_deletion(df):
  return df.dropna()

In [None]:
# 2. Mean/Median/Mode Imputation
# Replace numerical columns with mean/median
def mean_imputation(df):
  return df.fillna(df.mean(), inplace=True)

Do same for the columns `internal_memory` and `battery`.

In [None]:
def median_imputation(df):
  return df.fillna(df.median(), inplace=True)

In [None]:

def mode_imputation(df):
  return df.fillna(df.mode(), inplace=True)

In [None]:
def forward_fill(df):
  return df.fillna(method='ffill', inplace=True)

In [None]:
def backward_fill(df):
  return df.fillna(method='bfill', inplace=True)

In [None]:
def domain_knowledge_imputation(df):
  return df.fillna(value, inplace=True)

**2. Detect and handle duplicates**

​Detecting and handling duplicate data is essential for maintaining the integrity of your dataset.

In [None]:
# Detect duplicate rows
def detect_duplicates(df):
  return df[df.duplicated()]

In [None]:
# View duplicates
def view_duplicates(df):
  return df[df.duplicated(keep=False)]

In [None]:
# Remove duplicates
def remove_duplicates(df):
  df.drop_duplicates(inplace=True)



### **3. Detect and handle outliers**

**Outliers** are data points that deviate significantly from the normal distribution or expected trends within a dataset in the context of data analysis. These anomalous points can introduce noise, skew statistical measurements, and reduce the accuracy of analytical models.

As a result, identifying and dealing with outliers is crucial for generating trustworthy insights and making data-driven decisions. Outliers can take numerous forms, including extreme values, anomalies, and data-gathering errors.


### **Outlier detection techniques**
Outlier detection techniques vary in their advantages, limitations, and underlying assumptions. Therefore, it is crucial to select the appropriate method based on the specific characteristics of your data, the objectives of your analysis, and the requirements of your project.

- Statistical Method: Use IQR to identify outliers
- Z-Score: Flag values beyond ±3 standard deviations from the mean (assumes normal distribution)
- Visual Method: Use boxplots to visualize outliers.

In [None]:
def detect_outliers(df, column):
  Q1 = df[column].quantile(0.25)
  Q3 = df[column].quantile(0.75)
  IQR = Q3 - Q1

In [None]:
# Function to detect outliers using IQR
def iqr (data, column):
  q1 = data[column].quantile(0.25)
  q3 = data[column].quantile(0.75)
  iqr = q3 - q1

In [None]:
def remove_outliers(df, column):
  lower_bound = Q1 - 1.5 * IQR
  upper_bound = Q3 + 1.5 * IQR

In [None]:
# Function to detect outliers using zscore
def zscore(data, column):
  mean = data[column].mean()
  std = data[column].std()

## **Preprocessing**
**Data preprocessing** is a critical step in data analytics and machine learning, involving the transformation of raw, unstructured, or incomplete data into a clean, usable format. This ensures that data is accurate, consistent, and ready for analysis or model training.

### **1. Data Normalization**
This is a fundamental preprocessing technique in data analytics and machine learning. It involves adjusting the scales of numerical features to ensure that each contributes equally to the analysis, preventing features with larger ranges from dominating the mode.

1.	**Min-Max Scaling (Rescaling)**: This method transforms features to fit within a specified range, typically [0, 1]. It's particularly useful when the data distribution is uniform and does not contain outliers.

$$x' = \frac{x - \min(x)}{\max(x) - \min(x)}$$

2.	**Z-Score Normalization (Standardization)**: This technique centers the data around zero with a standard deviation of one, effectively converting the data into a standard normal distribution. It's suitable when the data follows a normal distribution and is less affected by outliers.

$$x' = \frac{x - \mu}{\sigma}$$

3.	**Log Scaling**: This approach applies the natural logarithm to the data, which can be beneficial when dealing with data that spans several orders of magnitude or exhibits exponential growth. It's particularly effective for data that follows a power law distribution.

$$x' = \log(x)$$

In [None]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler


In [None]:
def min_max_scaling(data, column):
  scaler = MinMaxScaler()
  data[column] = scaler.fit_transform(data[[column]])

In [None]:
# Apply Min-Max normalization
def min_max_normalization(data, column):
  min_value = data[column].min()
  max_value = data[column].max()

### **2. Encode categorical variables**

​Encoding categorical variables is a crucial preprocessing step in data analysis and machine learning. It involves transforming non-numeric data into a numerical format that algorithms can interpret effectively.

1. **Label Encoding**: This is suitable for **ordinal data** where the categories have a meaningful order. However, use caution with nominal data, as the numerical codes may imply an unintended ordinal relationship.

2. **One-Hot Encoding**: This is preferred for **nominal data** without an inherent order, as it prevents the introduction of ordinal relationships. Be mindful that it can increase the dimensionality of the dataset, especially with features containing many unique categories.

In [None]:
# One-Hot encoding
def one_hot_encoding(data, column):
  encoded_data = pd.get_dummies(data[column], prefix=column)
  data = pd.concat([data, encoded_data], axis=1)


In [None]:
def label_encoding(data, column):
  label_encoder = LabelEncoder()
  data[column] = label_encoder.fit_transform(data[column])
#