# Data Analysis with Python
---
### Table of contents
[Chapter 1: Importing Data Sets](#chapter-1-importing-data-sets)    
[Chapter 2: Data Wrangling](#chapter-2-data-wrangling)  
[Chapter 3: Exploratoty Data Analysis](#chapter-3-exploratory-data-analysis)    
[Chapter 4: Model Development](#chapter-4-model-development)    
[Chapter 5: Model Evaluation and Refinement](#chapter-5-model-evaluation-and-refinement)    
[Final Assignment](#final-assignment)   

## Chapter 1: Importing Data Sets
### Python Packages for Data Science
1. **Scientifics Computing** Libraries
- **Pandas** (Data Structures & Tools)
- **NumPy** (Arrays & Matrices)
- **SciPy** (Integrals, solving differential equations, optimisations)
2. **Visualisation** Libraries
- **Matplotlib** (plots & graphs, most popular)
- **Seaborn** (plots: heat maps, time series, violin plots)
3. **Algorithmic** Libraries
- **Scikit-learn** (Machine Learning: Regression, classification, and so on)
- **Statsmodels** (Explore data, estimate statistical models, and perform statistical tests)

### Importing Data
- Process of loading and reading data into Python from various resources
- Two important properties:
    - Format: .csv, .json, .xlsx, .hdf,...
    - File Path of dataset:
        - Computer: */Desktop/mydata.csv*
        - Internet: *https://archive.ics.uci.edu/autos/imports-85.data*
#### Importing and Exporting a CSV in Python
```py
import pandas as pd
url = 'https://archive.ics.uci.edu/autos/imports-85.data'
# Import
df = pd.read_csv(url, header = None) # Dataset without the header
# Printing the DataFrame in Python
df # print the entire DataFram
df.head(n) # Show the first "n" rows of DataFrame
df.tail(n) # Show the bottom "n" rows of DataFrame
# Adding headers
## Replace default header (by df.columns = headers)
headers = ["symboling", "normalised-losses", "make",...]
df.columns = headers
# Export
path = "C:\Windows\...\ automobile.csv"
df.to_csv(path)
```
|Data Format|Read|Save|
|-----------|----|----|
|csv|pd.read_csv()|df.to_csv()|
|json|pd.rea_json()|df.to_json()|
|Excel|pd.read_excel()|df.to_excel()|
|sql|pd.read_sql()|df.to_sql()|

### Getting Started Analysing Data in Python
- Basic insights from the data
    1. Understand your data before you begin any analysis
    2. Should check:
        - Data Typqes
        - Data Distributions
    3. Locate potential issues with the data
#### Data Types
|Pandas Type|Native Python Type|Description|
|-|-|-|
|object|string|numbers and strings|
|int64|int|numeric characters|
|float64|float|numeric characters with decimals|
|datatime64, timedelta[ns]| N/A (but see the *datatime* module in Python's standard library)|time data|
Why check data types?
- potential infor and type mismatch
- compatibility with Python methods
We use `dataframe.dtypes` to check data types
```py
df.dtypes
# Check a statistical summary
df.describe()
# Check full summary statistics (including object type attributes)
df.describe(include = 'all')
# Check a concise summary of DataFrame
df.info()
```

### Accessing Databases with Python
![image.png](attachment:image.png)
![image-2.png](attachment:image-2.png)
Concepts of the Python DB API
1. Connection Objects
    - Database connections
    - Manage transactions
2. Cursor Objects
    - Database Queries
What are Connection methods?
- cursor()
- commit()
- rollback()
- close()

Writing code using DB-API
```py
from dmodule import connect

# Create connection object
connection = connect('databasename', 'username', 'pswd')
# Create a cursor object to run queries and fetch results
cursor = connection.cursor()

# Run queries
cursor.execute('select * from mytable')
results = cursor.fetchall()

# Free resources to avoid usued connections
cursor.close()
connection.close()
```
### Lab: Importing Data Sets - Used Cars Pricing
[🔗 Open Lab: Importing Data Sets - Used Cars Pricing](lab_import_datasets_used_cars_pricing.ipynb)

### Overview: Laptop Pricing Data Set
[🔗 Open Overview: Laptop Pricing Data Set](overview_laptop_pricing_data_sets.pdf)

### Lab: Importing Data Sets - Laptop Pricing
[🔗 Open Lab: Importing Data Sets - Laptop Pricing](lab_importing_dataset_laptop_pricing.ipynb)

### Module 1 Cheat Sheet: Importing Data Sets
[🔗 Open Module 1 Cheat Sheet: Importing Data Sets](module1_cheatsheet.pdf)

## Chapter 2: Data Wrangling
### Pre-processing Data in Python
**Also know as:** Data Cleaning, Data Wrangling
- Accessing columns: `df[symboling']`, `df['body-style']
- Add value to each column: `df['symboling']=df['symboling']+1`

### Dealing with missing values in Python
- Missing values occur when no data calue is stored for a variable (feature) in a an observation.
- Could be represented as "?", "N/A", 0 or just a blank cell.

#### How to deal with missing value
1. **Check with the data collection source**
2. **Drop the missing values**
    - drop the variable
    - drop the data entry
3. **Replace the missing value**
    - replace it with an average (if similar datapoints)
    - replace it by frequency
    - replace it based on other functions
4. **Leave it as missing data**

#### How to drop misisng values in Python
Use `dataframe.dropna()` (axis=0 drops the entire row; axis=1 drops the entire column)

Example: 
```py
df.dropna(subset=['price'], axis=0, inplace=True)
df = df.dropna(subset=['price'], axis=0) # equivalent code
```

#### How to replace missing values in Python
Use `dataframe.replace(missing_value, new_value)`

Example: replace by the mean value
```py
mean = df['normalised-losses'].mean()
df['normalised-losses'].replace(np.nan, mean)
```

### Data Formatting
- Data is usually collected from different places and stored in different formats
- Bringing data into a common standard of expression allows users to make meaningful comparison

|Non-formatted|Formatted|
|-|-|
|- confusing <br> - hard to aggregate <br> - hard to compare| - more clear <br> - easy to aggregate <br> - easy to compare|

Example:
|City|City|
|-|-|
|N.Y.|New York|
|Ny|New York|
|NY|New York|
|New York|New York|

#### Applying calculations to an entire columns
Covert 'mpg' to 'L/100km' in Car dataset
|city-mpg|city-L/100km|
|-|-|
|21|11.2|
|21|11.2|
|19|12.4|
|...|...|
```py
df['city-mpg']= 235/df['city-mpg']

df.rename(columns={'city-mpg':'city-L/100km'}, inplace=True)
```
#### Incorrect data types
Sometimes the wrong data types is assigned to a feature
```py
df['price'].tail(5)
```
![image.png](attachment:image.png)


#### Correcting data types
- To *identify* data types: Use `dataframe.dtypes()` to identify data type
- To *convert* data types: Use `dataframe.astype()` to convert data type

Example: Convert data type to integer in column 'price'
```py
df['price']=df['price'].astype('int')
```

### Data Normalisation
- Uniform the features values with different range
![image-2.png](attachment:image-2.png)

|Not-normalised|Normalised|
|-|-|
|- 'age' and 'income' are in different range. <br> - hard to compare <br> - 'income' will be influence the result more| - similar value range <br> - similar intrinsic influence on analytical model|

#### Methods of normalising data
Several approaches for normalisation:

1. **Simple Feature scaling**: makes new values range between 0 and 1.
$$x_{new} = \frac{x_{old}}{x_{max}}$$
2. **Min-Max**: results new values range between 0 and 1.
$$x_{new}= \frac{x_{old} - x_{min}}{x_{max} - x_{min}}$$
3. **Z-score** (or **Standard score**): results values hover around 0 and typically range between -3  and +3, but can be higher or lower.
$$x_{new}= \frac{x_{old} - \mu}{\sigma}$$

##### Simple Featuqres Scaling in Python
With Pandas:
![image-3.png](attachment:image-3.png)

```py
df['length'] =(df['length'] - df['length'].min())/(df['length'].max() - df['length'].min())
```

##### Z-score
With Pandas:
![image-4.png](attachment:image-4.png)

```py
df['length'] = (df['length'] - df['length'].mean())/df['length'].std()
```

### Binning
- **Binning**: Grouping of values into "bins"
- Convert numeric into categorical variables
- Group a set of numerical values into a set of "bins"
- "price" is an feature range from 5,000 to 45,500
![image-5.png](attachment:image-5.png)
![image-6.png](attachment:image-6.png)

```py
bins = np.linspace(min(df['price']), max(df['price']), 4) 
# We need 3 bins, but we needs 4 numbers as devider
group_names = ['Low', 'Medium', 'High']
# We create a list group underscore names that contains the different bin names
df['price-binned'] = pd.cut(df['price'], bins, labels=group_names, include_lowest=True)
```
#### Visualising binned data
- Histograms
![image-7.png](attachment:image-7.png)

### Turning Categorical Variables into Quantitative Variables in Python
#### Categorical variables
Problem:
Most statistical models cannot take in the object/strings as input
Solution:
- Add dummy variables for each unique category
- Assign 0 or 1 in each category
#### Dummy variables in Python pandas
- Use pandas.get_dummies() method.
- Convert categorical variables to dummy variables (0 or 1)
```py
pd.get_dummies(df['fuel'])
```

### Lab: Data Wrangling - Used Cars Pricing
[🔗 Open Lab: Data Wrangling - Used Cars Pricing](lab_data_wrangling_used_cars_pricing.ipynb)

### Lab: Data Wrangling - Laptop Pricing
[🔗 Open Lab: Data Wrangling - Laptop Pricing](lab_data_wrangling_laptop_pricing.ipynb)

### Module 2 Cheat Sheet: Data Wrangling
[🔗 Open Module 2 Cheat Sheet: Data Wrangling](module2_cheatsheet.pdf)

## Chapter 3 Exploratory Data Analysis
### Exploratory Data Analysis
- Preliminary step in data analysis to:
    - Summarise main characteristics of the data
    - Gain better understanding of the data set
    - Uncover relationships between variables
    - Extract important variables
- Question:
`"What are the characteristics which have the most impact on the car price?`

### Descriptive Statistics
- Explore data before building complicated models
- Calculate descriptive statistics for your data
- Describe basic features of data
- Giving short summaries about the sample and measure of the data
- Summarise statistics using pandas **describe()** method
```py
df.describe()
```
- Summarise the categorical data is by using the **value_counts()** method
```py
drive_wheels_counts = df['drive-wheels'].value_counts()
drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)
drive_wheels_counts.index.name = 'drive-wheels'
```
- Summarise numeric data by using the **Box Plots**
![image.png](attachment:image.png)
```py
sns.boxplot(x='drive-wheels', y='price', data=df)
```
- Descriptive statistics by **Scatter Plots**
    - each observation represented as a point
    - scatter plots show the relationship between two variables:
        1. Predictor/independent variables on x-axis
        2. Target/dependent variables on y-axis
```py
y = df['price']
x = df['engine-size']
plt.scatter(x,y)

plt.title('Scatterplot of Engine Size vs Price')
plt.xlabel('Engine Size')
plt.ylabel('Price')
```
![image-2.png](attachment:image-2.png)
### GroupBy in Python
Question:
`Is there any relationship between the different types of 'drive system' and the 'price' of the vehicles?`

- Use Pandas **dataframe.groupby()** method:
    - Can be applied to categorical variables
    - Group data into categories
    - Single or multiple variables
```py
df_test = df[['drive-wheels', 'body-style', 'price']]
df_grp = df_test.groupby(['drive-wheels', 'body-style'], as_index=False).mean()
```
- Pandas method: **Pivot()**
    - One variable is displyed along the columns, and the other variable is displyed along the rows

```py
df_pivot = df_grp.pivot(index='drive-wheels', columns='body-style')
```
- Heatmap
    - Plot target variable against multiple variables
```py
plt.pcolor(df_pivot, cmap="RdBu")
plt.colorbar()
plt.show()
```
### Creating Different Types of Plots in Python
[🔗 Open Creating Different Typews of Plots in Python](creating_diff_types_of_plots_inPython.pdf)

### Correlation
What is Correlation?
- Measures to what extent different variables are interdependent
- For example:
    - Lung cancer --> Smoking
    - Rain --> Umbrella
- Correlation doeasn't imply causation

Correlation: Positive linear relationship
- Correlation between two features (engine-size and price)
```py
sns.regplot(x='engine-size', y='price', data=df)
plt.ylim(0,)
```
![image-3.png](attachment:image-3.png)
Correlation: Negative linea relationship
- Correlation between two features (high-way and price)
```py
sns.regplot(x='high-way', y='price', data=df)
plt.ylim(0,)
```
![image-4.png](attachment:image-4.png)

### Correlation - Statistics
Pearson Correlation
- Measure the strength of the strength of the correlation between two features
    - Correlation coefficient
    - p-value
- Correlation coefficient
    - Close to +1: Large Positive relationship
    - Close to -1: Large Negative relationship
    - Close to 0: No relationship
- p-value
    - p-value < 0.001 strong certainty in the result
    - p-value < 0.05 moderate certainty in the result
    - p-value < 0.1 week certainty in the result
    - p-value > 0.1 no certainty in the result
```py
pearson_coef, p-value = stats.pearsonr(df['horsepower'], df['price'])
```
Correlation - Heatmap
![image-5.png](attachment:image-5.png)

### Chi-Squared Test for Categorical Variables
[🔗 Open Chi-square Test for Categorical Variables](chisquare_test_for_categorical_variables.pdf)

### Hands-on Lab: Exploratory Data Analysis
[🔗 Open Hands-on Lab: Exploratory Data Analysis - Used Cars Pricing](Exploratory_data_analysis_cars.ipynb)

[🔗 Open Hands-on Lab: Exploratory Data Analysis - Laptop Pricing](Exploratory_data_analysis_laptop.ipynb)

## Chapter 4: Model Development

## Chapter 5: Model Evaluation and Refinement

## Final Assignment