# Data Science - Case Study

Classification is a supervised machine learning task that involves assigning predefined labels or categories to input data based on its features. The goal is to build a model that learns the patterns in the data and can accurately predict the label or category of new, unseen data. It's commonly used for tasks like spam email detection, disease diagnosis, sentiment analysis, image recognition, and much more.

Let's go through a simple case study to understand classification better:

Case Study: Email Spam Detection

Problem Statement: You work for an email service provider and want to develop an automated system that can classify incoming emails as either spam or not spam (ham).

Data: You have a dataset of emails, each labeled as spam or ham, along with the text content of the emails.

Steps Involved:

Data Collection and Preprocessing:

Gather a labeled dataset of emails, where each email is labeled as spam or ham.
Preprocess the text data by removing special characters, converting to lowercase, and tokenizing the text into words.
Feature Extraction:

Convert the text data into numerical features that machine learning algorithms can work with. This could involve techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or word embeddings.
Model Selection:

Choose a classification algorithm. Common choices include:
Logistic Regression
Naive Bayes
Support Vector Machines
Decision Trees
Random Forests
Neural Networks
Model Training:

Split the dataset into training and testing sets. The training set is used to train the model, and the testing set is used to evaluate its performance.
Train the chosen classification model using the training data.
Model Evaluation:

Evaluate the model's performance using metrics such as accuracy, precision, recall, F1-score, and confusion matrix on the testing data.
Adjust model parameters and features to improve performance if necessary.
Model Deployment and Prediction:

Once satisfied with the model's performance, deploy it to make predictions on new, unseen emails.
When a new email comes in, preprocess its text, extract features, and use the trained model to predict whether it's spam or ham.
Ongoing Monitoring and Maintenance:

Continuously monitor the model's performance on real-world data.
Re-evaluate and retrain the model periodically to account for changing patterns in spam emails.
This case study illustrates how classification is applied to a real-world problem. The main steps involve data preprocessing, feature extraction, model selection, training, evaluation, deployment, and ongoing maintenance.

Remember that different classification algorithms might perform differently on different datasets, and the choice of algorithm depends on the nature of the data and the specific problem you're trying to solve.

In [1]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Sample dataset: weight in grams and color (0 for red, 1 for orange)
data = np.array([[100, 0], [130, 0], [135, 1], [150, 1]])
labels = np.array([0, 0, 1, 1])  # 0: Apple, 1: Orange

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, labels, test_size=0.2, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 0.00%


In [2]:
print('data :\n',data)
print('X_train : \n',X_train)
print('y_train :\n',y_train)
print('X_test : \n',X_test)
print('y_test : \n',y_test)

data :
 [[100   0]
 [130   0]
 [135   1]
 [150   1]]
X_train : 
 [[150   1]
 [100   0]
 [135   1]]
y_train :
 [1 0 1]
X_test : 
 [[130   0]]
y_test : 
 [0]


In [3]:
train_test_split?

In this example :
1. We have a simple dataset with weight and color as features and corresponding labels (0 for Apple, 1 for Orange).
2. We split the dataset into training and testing sets using train_test_split.
3. We create a LogisticRegression model and train it using the training data.
4. We make predictions on the testing data and calculate the accuracy using accuracy_score.

In [4]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Labels

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
model = RandomForestClassifier()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)

# Print the accuracy
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 100.00%


In [5]:
X

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

1. We're using the Iris dataset, which has more complex features.
2. We're using a RandomForestClassifier, which is an ensemble method that can capture complex relationships in the data.
3. We split the dataset into training and testing sets using train_test_split.
4. We train the model on the training data and predict the labels for the testing data.
5. We calculate the accuracy using accuracy_score.

import numpy as np: This imports the NumPy library and aliases it as np.

from sklearn.datasets import load_iris: This imports the load_iris function from the scikit-learn library's datasets module. The load_iris function provides the famous Iris dataset for classification.

from sklearn.model_selection import train_test_split: This imports the train_test_split function, which is used to split the dataset into training and testing sets.

from sklearn.ensemble import RandomForestClassifier: This imports the RandomForestClassifier class, which is an ensemble learning method used for classification.

from sklearn.metrics import accuracy_score: This imports the accuracy_score function, which calculates the accuracy of a classification model.

iris = load_iris(): This loads the Iris dataset.
X = iris.data: This assigns the features (input variables) of the Iris dataset to the variable X.
y = iris.target: This assigns the target labels (output variable) of the Iris dataset to the variable y.

train_test_split(X, y, test_size=0.2, random_state=42): This function splits the data into training and testing sets. X_train and y_train will contain the training data and labels, while X_test and y_test will contain the testing data and labels. test_size=0.2 specifies that 20% of the data will be used for testing, and random_state=42 sets a seed for reproducibility.

model = RandomForestClassifier(): This creates an instance of the RandomForestClassifier class, which will be used as our classification model.

model.fit(X_train, y_train): This trains the model using the training data (X_train and y_train).

predictions = model.predict(X_test): This uses the trained model to make predictions on the testing data (X_test).

accuracy = accuracy_score(y_test, predictions): This calculates the accuracy of the model's predictions by comparing the predicted labels (predictions) with the actual labels (y_test).

print(f"Accuracy: {accuracy * 100:.2f}%"): This prints the accuracy as a percentage with two decimal places.

In [6]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Sample dataset: mileage in miles, age in years, brand (0 for Toyota, 1 for Honda)
data = np.array([[50000, 3, 0], [80000, 5, 1], [20000, 1, 0], [60000, 4, 1]])
prices = np.array([15000, 12000, 18000, 13000])  # Prices in dollars

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, prices, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = model.predict(X_test)

# Calculate Mean Squared Error and R-squared
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

# Print the results
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")


Mean Squared Error: 810000.00
R-squared: nan




1. We have a simple dataset with mileage, age, and brand as features, and the corresponding prices of used cars.
2. We split the dataset into training and testing sets using train_test_split.
3. We create a LinearRegression model and train it using the training data.
4. We make predictions on the testing data and calculate the Mean Squared Error (MSE) and R-squared using mean_squared_error and r2_score.

In [8]:
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Load the Boston Housing dataset
boston = load_boston()
X = boston.data  # Features
y = boston.target  # Prices

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Linear Regression model
model = LinearRegression()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the testing data
predictions = model.predict(X_test)

# Calculate Mean Squared Error and R-squared
mse = mean_squared_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

# Print the results
print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared: {r2:.2f}")

ImportError: 
`load_boston` has been removed from scikit-learn since version 1.2.

The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.

The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np

    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>

[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>


Linear Regression is a fundamental supervised learning algorithm used for predicting a continuous target variable based on one or more input features. It assumes a linear relationship between the input features and the target variable.

In the context of a simple linear regression with one feature (univariate linear regression), the model can be represented as:
y = mx + b

Where:

y is the predicted target variable (output).
x is the input feature.
m is the slope (weight) of the line.
b is the y-intercept.

The goal of linear regression is to find the values of m and b that minimize the difference between the predicted values and the actual target values. This is typically achieved using a mathematical optimization technique called the Ordinary Least Squares (OLS) method.

In the context of multiple features (multivariate linear regression), the model equation becomes:
y = b0 + b1*x1 + b2*x2 + ... + bn*xn

Where:

y is the predicted target variable.
x1, x2, ..., xn are the input features.
b0, b1, ..., bn are the coefficients (weights) associated with the features.

Here are the steps involved in implementing Linear Regression:

1. Data Preparation: Collect and preprocess the dataset. This includes handling missing values, encoding categorical variables, and splitting the data into training and testing sets.

2. Model Creation: Choose the type of linear regression (simple or multiple) and create the linear regression model object using a suitable library (e.g., scikit-learn in Python).

3. Model Training: Fit the model to the training data, which involves estimating the coefficients that minimize the difference between predicted and actual values.

4. Prediction: Use the trained model to make predictions on new, unseen data.

5. Evaluation: Evaluate the model's performance using appropriate metrics such as Mean Squared Error (MSE), R-squared (coefficient of determination), and others.

6. Interpretation: Interpret the coefficients to understand the relationship between the features and the target variable. Positive coefficients indicate a positive correlation, while negative coefficients indicate a negative correlation.

7. Prediction and Generalization: Use the trained model to predict outcomes on new data. Ensure that the model generalizes well to unseen data.

Assumptions of Linear Regression:

1. Linearity: The relationship between the input features and the target variable should be approximately linear.
2. Independence: The errors (residuals) should be independent of each other.
3. Homoscedasticity: The variance of the errors should be roughly constant across all levels of the target variable.
4. Normality: The errors should be normally distributed.

Remember that while Linear Regression is a powerful and widely used algorithm, it might not be suitable for all types of data and relationships. In some cases, more complex algorithms like Polynomial Regression, Decision Trees, or Neural Networks might be needed to capture nonlinear relationships in the data.