# Assignment Overview

Links to the notes discussed in the video
* [Model Selection Overview](./ModelSelect.pdf)
* [Model Types](./ModelType.pdf)
* [Model Decision Factors](./ModelDecisionFactors.pdf)
* [Generalization Techiques](./Generalization.pdf)

The assignment consists of two parts requiring you to select appropriate models with associated code/text.

1. Determine challenge and relevant model for two distinct situations (fill out this notebook). 
1. Address the data code needed and the model for [car factors](./CarFactors/carfactors.ipynb) contained in the subdirectory, CarFactors.

* ***Check the rubric in Canvas*** to make sure you understand the requirements and the assocated weights grading

# Part 1: Speed Dating Model Selection

You are to explore the data set on speed dating and construct two models that provide some insight such as grouping or predictions.  The models must come from different model areas such as listed as categories in the [ModelTypes](./ModelTypes.pdf) document.  You must justify your answer considering the data and the prediction value.

The data is contained in [SpeedDatingData.csv](SpeedDatingData.csv).  The values are detailed in [SpeedDatingKey.md](./SpeedDatingKey.md).  The directory also contains the original key document - SpeedDatingDataKey.docx but jupyter lab is unable to render it.  You are free to render it outside of jupyter lab if something didn't translater clearly.  The open source tool [pandoc](https://pandoc.org/installing.html) was used to perform the translation.  It is useful for almost any translation and works in all major operating systems

# Model 1

## Outline the challenge 

My overall challenge based on the given data is to target audience for a dating events company. The idea of this company is to slow down things because speed dating is not for everyone.The goal of the company is to give more time to assess and analyze partners. The intended goal will be achieved by allowing people to give a second chance to a prospective date even if he/she is not their first choice in the first instance. 

Hence, the first challenge is to create a model that can identify folks who would be interested a slow activity based dating. To do this, I intend to apply clustering. Clustering model will help in creating groups of people based on their willingness to attend dating related activites. Probably, three clusters can be created - for sure, maybe and absolutely not. Once clustered, models coming out of challenge-2 can be applied.

### Select the features and their justification 

The first feature I would like to focus on are folks who give and receive rating of 3-7. To me, such people are unsure. If someone gives or receives a rating of less than 3 or more than 7, then they are pretty sure about their decision.

Then, I will use features that dont focus on "immediate acceptance/rejection" qualities like looks, race, religion etc.


### Note necessary feature processing such as getting rid of empty cells etc.

Handling missing values by imputation (kind of imputation will depend on nature of data), Handling categorical variables by label or one hot encoding, Handling numerical variables by standard scaling, normalizing, binning etc, Feature transformation like log or polynomial features, Feature selection using their importance, statistics, and techniques like recursive feature elimination, random forest and gradient booster classifiers etc.

### Model Selection

The fundamental nature of my first model would be clustering. Clustering because I would like to find the pool of prospective clients. In order to do so, I need to break down the market into varius clusters to tune my marketing and optimize activities

Place text answer here

In [None]:
# ******************* Clustering models for challenge -1 ****************

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.preprocessing import StandardScaler

# Generate sample data

# Select relevant features
selected_features = ['go_out', 'sports', 'exphappy', 'ratings']  # This list will change

# Create a subset dataframe with the selected features
subset_df = df[selected_features]

# Normalize the data
scaler = StandardScaler()
normalized_data = scaler.fit_transform(subset_df)

# Create a new dataframe with the normalized data
normalized_df = pd.DataFrame(normalized_data, columns=selected_features)

# Now 'normalized_df' contains the normalized data for the selected features, and you can use it for clustering

# Apply KMeans clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
kmeans_labels = kmeans.labels_

# Apply Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
agg_labels = agg_clustering.fit_predict(X)

# Apply DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=10)
dbscan_labels = dbscan.fit_predict(X)

# Visualize the clustering results
plt.figure(figsize=(12, 4))

plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis')
plt.title('KMeans Clustering')

plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis')
plt.title('Agglomerative Clustering')

plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis')
plt.title('DBSCAN Clustering')

plt.show()

# Model 2

## Outline the challenge

My overall challenge based on the given data is to target audience for a dating events company. The idea of this company is to slow down things because speed dating is not for everyone.The goal of the company is to give more time to assess and analyze partners. The intended goal will be achieved by allowing people to give a second chance to a prospective date even if he/she is not their first choice in the first instance. 

Based on this description, my second challenge would be to classify peole for activities or activities for people. Based on clusters that come out of model-1, I intend to apply classification on each cluster. A classification model will classify people into various kind of activities or activities into various kind of people. As an example people for outdoor activities or activities for introvert people.

Then, finally a prediction model can be used to predict if the people will attend the activities or not.

### Select the features and their justification

For this model, I would like to focus on activity related features. The idea is to know which activities are more valuable and to whom. The goal would be to see/predict which activities for whom (by age, education etc) result in more matches so that I can use/discard/customize activities.

Secondly, I would also like to focus on going out related features because people who like to go out might come for events

### Model Selection

Outline the rationale for selecting the model noting how its capabilities address your challenge

This model will mainly involve classification and finally prediction. Classification because I would like to suggest different activities to different people. Prediction because I would like to know who to focus on more.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Select relevant features and target
selected_features = ['movies', 'sports', 'museum']  # This list will change; here for demonstration
target_column = 'target'  # This will be decided based on the survey data

X = df[selected_features]
y = df[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Initialize classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'KNN': KNeighborsClassifier()
}

# Train and evaluate each classifier
for clf_name, clf in classifiers.items():
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    print(f"Classifier: {clf_name}")
    print(f"Accuracy: {accuracy:.2f}")
    print(f"Classification Report:\n{report}\n")