# Exploratory Data Analysis (EDA) and Time Series Analysis

Moving forward, we will be using google colab to run our code. You can access google colab with this link: https://colab.research.google.com/

# Correlation and Relationship Analysis

Correlation and relationship analysis are important techniques in data science for understanding the association between variables. In this lesson, we will explore the concept of correlation, its types, and how to analyze relationships between variables using Python. We will use practical examples to demonstrate these concepts.

Correlation measures the statistical relationship between two variables. It helps us understand how changes in one variable are associated with changes in another variable. Correlation does not imply causation, but it indicates the strength and direction of the relationship.

There are different correlation coefficients used to quantify the relationship between variables:

Pearson correlation coefficient (r): It measures the linear relationship between two continuous variables. The value of r ranges from -1 to +1. A positive value indicates a positive linear relationship, a negative value indicates a negative linear relationship, and a value close to zero indicates no linear relationship.

In [None]:
import numpy as np
import pandas as pd

# Example data
x = np.array([1, 2, 3, 4, 5])
y = np.array([3, 4, 5, 6, 7])

# Calculate Pearson correlation coefficient
correlation_matrix = np.corrcoef(x, y)

# since .corrcoef returns a matrix of the correlations
# for different combinations of variables, we choose the index
# [0, 1] to access the correlation for variable 1 (0th index) 
# and variable 2 (1st index)
pearson_coefficient = correlation_matrix[0, 1]

print("Pearson correlation coefficient:", correlation_matrix)

Given the above result, how would you describe the relationship between these data points?

<span style = "background-color: yellow">
TODO: Pick two numeric data points from your own dataset that you think might be correlated. Use the above correlation calculation to quantify the relationship between those two points. In a sentence or two, write down how you would describe the relationship between the two data points.
</span>

We can also visualize our correlation by seeing the scatterplot of the variables plotted against one another.

In [None]:
import matplotlib.pyplot as plt

# Example data
x = np.array([1, 2, 3, 4, 5])
y = np.array([3, 4, 5, 6, 7])

# Create a scatter plot
plt.scatter(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

As you can see, we programmatically created a positive, linear relationship in the above example. That is why the correlation coefficient was a 1 and the scatter plot shows a clear positive, linear relationship.

#### When interpreting correlation coefficients, recall these concepts:


Positive Correlation: A positive correlation coefficient indicates that as one variable increases, the other variable tends to increase as well. The closer the value is to +1, the stronger the positive correlation.

Negative Correlation: A negative correlation coefficient indicates that as one variable increases, the other variable tends to decrease. The closer the value is to -1, the stronger the negative correlation.

No Correlation: A correlation coefficient close to zero indicates no linear relationship between the variables. However, it's important to note that there could still be a non-linear relationship or a relationship that is not captured by the correlation coefficient.

#### Correlation analysis may be affected by missing values and outliers in the data, so be sure to consider the following:

Missing Values: Missing data can lead to biased correlation results. You can handle missing values by imputation techniques (e.g., mean, median, or regression imputation) or by removing observations with missing data, depending on the situation.

Outliers: Outliers can have a significant impact on correlation coefficients. Consider identifying and handling outliers before performing correlation analysis. Techniques like winsorization, trimming, or using robust correlation measures can help mitigate the influence of outliers.

# What is Machine Learning

For the upcoming lessons, we will be diving into machine learning. Machine learning is a branch of artificial intelligence (AI) that focuses on developing algorithms and models that enable computers to learn and make predictions or decisions without being explicitly programmed. In simple terms, machine learning is about teaching computers how to learn from data and use that knowledge to perform tasks or make predictions.

The 3 fundamental steps of Machine Learning can be described as first getting your data, then creating a model + training it with the existing data, and lastly using the model to make predictions.

# Intro to Scikit-learn

Scikit-learn, also known as sklearn, is a widely-used open-source machine learning library for Python. It provides a range of efficient tools and algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing of data.

### Loading the Data

First, we need to load our data set. Sklearn does have its own pre-loaded datasets that we can access. One such dataset is the "Iris Dataset", often referred to as the "Hello World" of Machine Learning. Recall that "Hello World" is synonymous with "the first program you write in any programming language". This dataset contains the sepal and pedal length of different flowers.

Let's import it from sklearn's pre-loaded datasets and start with some basic exploratory analysis.

In [None]:
from sklearn.datasets import load_iris

# Load Iris Data
iris = load_iris()

### Understanding the Data

Let's get a better understanding of the data by turning it into a pandas dataframe then using the `.head()` function to sample some rows.

In [None]:
import pandas as pd

iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df.head()

Then let's check the columns.

In [None]:
iris_df.columns

In the Iris dataset, the target values are encoded as 0, 1, and 2. Here is the mapping:

- 0: setosa
- 1: versicolor
- 2: virginica

Let's add these columns to demonstrate how the genus of the flower is related to the different features of the plant.

In [None]:
# Add the target column to your DataFrame
iris_df['target'] = iris.target

Now let's try to find patterns in our data using pairplot.

Pairplot visualizes given data to find the relationship between them where the variables can be continuous or categorical. Do you see any patterns here? Which combination(s) of features appear to be correlated, or perhaps help us group the plant by genus (the colors on the dots)?

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.pairplot(iris_df, hue='target')
plt.show()


As you can see, these different features appear to provide helpful groupings. Now, what if we wanted to **predict** the genus of a plant without knowing its genus? This is where Machine Learning can help.

We will want to categorize this data with a type of regression called "Logistic Regression". Before we do any Machine Learning, we want to prepare our data. We will want to **train** a model where the output is labeled (ex. the data **has** the plant's genus) and **evaluate** the model to observe it's predictive ability.

Given that our training data is **labeled** in the training data, that makes this model a "Supervised" Model. The main difference between supervised vs unsupervised learning is the need for labelled training data. Supervised machine learning relies on labelled input and output training data, whereas unsupervised learning processes unlabelled or raw data.

In [None]:
from sklearn.model_selection import train_test_split

# All columns except 'target'
X = iris_df.drop('target', axis=1)

# Only 'target' column
y = iris_df['target']

# The function train_test_split splits arrays or matrices into random train and test subsets. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Train a machine learning model

We will use a simple logistic regression model for this multi-class classification problem. Despite its name, logistic regression can be used in classification problems, where the goal is to predict a categorical dependent variable. The independent variables would be the features of the plant (petal length, sepal width, etc.) and the dependent variable would be the genus of the plant (categorical data that we represent numerically with setosa (0), versicolor (1), and virginica (1)).

While we will go into how a machine learning model trains, a bit of the math behind it, and when to use which model in future lessons - the following is a brief over of Logistic Regression.

Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function. One of the main ideas behind Logistic Regression is the sigmoid function. When our input (z) is very large, sigmoid returns a value close to 1, and when our input is very small, sigmoid returns a value close to 0. It maps any real-valued number into another value between 0 and 1.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=200, random_state=42)
model.fit(X_train, y_train)

### Predict the target values for the test dataset

Now that the model has been trained (on the **labeled** data with the flower's genus), it can be used to predict the species of iris flowers in the test dataset (which is **not labeled**).

In [None]:
predictions = model.predict(X_test)

### Evaluate the model

Now, we need to see how well the model performs. One common metric for classification is accuracy, which is the proportion of test instances that were classified correctly.

In [None]:
from sklearn.metrics import accuracy_score

print("Model accuracy is: ", accuracy_score(y_test, predictions))


A confusion matrix gives a better representation of what errors are slipping through with additional metrics.

A confusion matrix is a table often used to describe the performance of a classification model on a set of data for which the true values are known. It essentially is a summary of prediction results on a classification problem.

For a binary classification problem, the confusion matrix is a 2x2 table:

|                     | Predicted Positive | Predicted Negative |
|---------------------|--------------------|--------------------|
| Actual Positive     | True Positive (TP) | False Negative (FN)|
| Actual Negative     | False Positive (FP)| True Negative (TN) |


- True Positives (TP): These are the correctly predicted positive values which means that the value of actual class is yes and the value of the predicted class is also yes.

- True Negatives (TN): These are the correctly predicted negative values which means that the value of actual class is no and value of predicted class is also no.

- False Positives (FP): When the actual class is no and predicted class is yes. Also known as "Type I error".

- False Negatives (FN): When the actual class is yes but the predicted class in no. Also known as "Type II error".

##### Accuracy
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observations to the total observations.

Accuracy = TP+TN/TP+FP+FN+TN

##### Precision
Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. High precision relates to the low false positive rate.

Precision = TP/TP+FP

##### Recall (Sensitivity)
Recall (also known as Sensitivity, Hit Rate, or True Positive Rate) is the ratio of correctly predicted positive observations to the all observations in actual class.

Recall = TP/TP+FN

Let's now observe the confusion matrix for what our model predicted on the test data (after being trained on the training data).

On the Y axis you will see the actual genus labels of the data, and on the X axis you will see the predicted genus (from the Logistic Regression Model).

Here is an example of a confusion matrix with some dummy data:

|                     | Predicted Positive | Predicted Negative |
|---------------------|--------------------|--------------------|
| Actual Positive     | 100                | 50                 |
| Actual Negative     | 30                 | 120                |

In this matrix,

- True Positives (TP) = 100 (The model correctly predicted Positive 100 times)
- False Negatives (FN) = 50 (The model incorrectly predicted Negative 50 times when it was actually Positive)
- False Positives (FP) = 30 (The model incorrectly predicted Positive 30 times when it was actually Negative)
- True Negatives (TN) = 120 (The model correctly predicted Negative 120 times)

We can calculate accuracy, recall, and precision with these numbers:

- Accuracy: It is simply a ratio of correctly predicted observations to the total observations.
  - Accuracy = (TP+TN) / (TP+FP+FN+TN) = (100+120) / (100+30+50+120) = 220 / 300 = 0.73 or 73%
- Recall (Sensitivity): It is the ratio of correctly predicted positive observations to the all observations in actual positive class.
  - Recall = TP / (TP+FN) = 100 / (100+50) = 100 / 150 = 0.67 or 67%
- Precision: It is the ratio of correctly predicted positive observations to the total predicted positive observations.
  - Precision = TP / (TP+FP) = 100 / (100+30) = 100 / 130 = 0.77 or 77%

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, predictions)

sns.heatmap(cm, annot=True)
plt.title('Confusion Matrix')
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.show()

<span style = "background-color: yellow">
TODO: With your neighbor, choose individually between precision and recall. Try calculating the precision and recall for your confusion matrix, then share your math with one another.
</span>