# Assignment Overview

Links to the notes discussed in the video
* [Model Selection Overview](./ModelSelect.pdf)
* [Model Types](./ModelType.pdf)
* [Model Decision Factors](./ModelDecisionFactors.pdf)
* [Generalization Techiques](./Generalization.pdf)

The assignment consists of two parts requiring you to select appropriate models with associated code/text.

1. Determine challenge and relevant model for two distinct situations (fill out this notebook). 
1. Address the data code needed and the model for [car factors](./CarFactors/carfactors.ipynb) contained in the subdirectory, CarFactors.

* ***Check the rubric in Canvas*** to make sure you understand the requirements and the assocated weights grading

# Part 1: Speed Dating Model Selection

You are to explore the data set on speed dating and construct two models that provide some insight such as grouping or predictions.  The models must come from different model areas such as listed as categories in the [ModelTypes](./ModelTypes.pdf) document.  You must justify your answer considering the data and the prediction value.

The data is contained in [SpeedDatingData.csv](SpeedDatingData.csv).  The values are detailed in [SpeedDatingKey.md](./SpeedDatingKey.md).  The directory also contains the original key document - SpeedDatingDataKey.docx but jupyter lab is unable to render it.  You are free to render it outside of jupyter lab if something didn't translater clearly.  The open source tool [pandoc](https://pandoc.org/installing.html) was used to perform the translation.  It is useful for almost any translation and works in all major operating systems

# Model 1

## Outline the challenge 

The challenge is to determine when two people will be a match. The feature to predict is "match" where the value can be a 0 or 1 (numerical representation of match or no match). A successful prediction will match the true value. In the case of speed dating, true positives are more important than false positives. We do not want to falsely predict that two people will be a match when dating. Likewise, we want to minimize false negatives where a match should occur but was missed. 

### Select the features and their justification 

The features I will use for this model are gender, order, match, int_corr, samerace, age_o, age, race_o, race, pf_o_att, dec_o, attr_o, field_cd, mn_sat, imprace, from, goal, date, go_out, career_c, attr1_1 , ainc1_1, intel1_1, fun1_1, amb1_1, attr3_1 , ainc3_1, int3_1, fun3_1 amb3_1, attr5_1 , ainc5_1, int5_1, fun5_1 amb5_1. I chose these features because I think they will impact the most on whether there is a match or not. 

An assumption we are making is that this speed dating event is only for straight participants. If this were not true, gender may not be a good indicator, and would rather cause the data to be overfitted to the straight population since there are more straight datapoints of matches. 

I also chose to use order as a feature because the order of meeting the person may impact their perception and chances of a match. The pair may forget someone they met earlier on in the night and perhaps be more inclined to match with someone met later on. 

I decided to use int_corr as a measure of the participant and partner's shared interest and ignore the actual values for their interest/hobby features as they are redundant. This feature quantifies the correlation between the interest features. 

I also decided to use the samerace and imprace feature. These two features can be combined to create a "race_probability" feature. I decided not to use the imprelig feature. This may seem helpful at first, but since there is no "religion" feature data for participants, there is no data that can be used with "imprelig" to compare the religions of two candidates. 

I chose to user field_cd, mn_sat, from, goal, career_c and go_out as features as well because from my knowledge, I know these are featues that may impact the outcome of a match.

Finally, I chose to use three sets of the attribute features, the rating of how the participant measures up to each attribute in his or her own opinion, what the participant looks for in a date, and what the participant thinks others perceive him or her as. This will let us know what the participant sees him or herself as and whether it matches what the potential candidate is looking for. 

### Note necessary feature processing such as getting rid of empty cells etc.

- Create the race_probability feature: first convert same_race to -1 or 1 for no and yes. Then multiple the imprace feature by this value to get the race_probability. This feature would range from -10 to 10 and can then be rescaled between -1 and 1.
- One-hot encode catergorical data values and rescale to standardize the numerical values.
- Fill in any missing numerical values with the average score. I am choosing to use the average because we do not want any of the attribute feature to be 1 or 10 (too high or low). If there is a value missing for these features, it should indicate an average manifestation of that attribute. 

### Model Selection

I will use a random forest classifier here. I chose to use this because there are a number of different features and a number of data points available to train on. There may be hidden relationships between features that this model will be able to use. A neural network or SVM may also have been good choices. A neural network is alse able to learn complex correlations between features. However, I felt there are not enough datapoints to train a neural network for this case. An SVM would also make a good classifier since it works well when the data has lots if features with non-linear decision boundaries, and is high dimensional. However, I thought the given data points were sufficient enough where we do not need an SVM (the data is not very high dimensional compared to the amount of datapoints). 

In [15]:
# Enter python code of constructing your selected model  - CODE REQUIRED! (only the model creation)

# Assuming you are using Python with pandas, scikit-learn, and seaborn for visualization
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv('SpeedDatingData.csv')

#process df and create new features - sample processing
df = df[['gender', 'order', 'match', 'int_corr', 'samerace', 'age_o', 'age', 'race_o', 'race', 'pf_o_att', 'dec_o', 'attr_o', 'field_cd', 'mn_sat', 'imprace', 'from', 'goal', 'date', 'go_out', 'career_c', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'attr3_1',
    'sinc3_1', 'fun3_1', 'amb3_1', 'attr5_1', 'sinc5_1', 'fun5_1', 'amb5_1']]
df = pd.get_dummies(df, columns=['from'], drop_first=True)
X = df.loc[:, df.columns != "match"]
y = df['match']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a Random Forest Classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Model 2

## Outline the challenge

The challenge I chose is to determine the patterns between similar demographics. I am curious how much demographics affect matching tendencies of people. The goal of this challenge is to use demographic data and find clusters of grouping where matches are more likely based on these features.

### Select the features and their justification

The features I chose for this challenge are based on demographics: age, from, race, undergrad, mn_sat, zipcode, income, and career_c. I chose these features because they cover a range of demographics from race, location of origin, income, education, and current location of residence. Further, I decided to include mn_sat because there have been popular studies done indicating that SAT scores can be correlated to a child's family income, so there is a proven demographic correlation here. Finally, I chose to include career_c because I know certain careers tend to favor certain demograohics of people, so this feature will provide some insight for our challenge.

### Model Selection

I chose to use a K-means clustering model here. This is because this model allows us to investigate different features in datapoints and specify an equation to determine their distance from each other in the feature space. This allows for clusters to form and predictions to be made on these clusters.

In [19]:
# Enter python code of constructing your selected model  - CODE REQUIRED! (only the model creation)
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

df = pd.read_csv('SpeedDatingData.csv')

#process df and create new features - sample processing
df = df[['age', ' from', 'race', 'undergrad', 'mn_sat', 'zipcode', 'income', 'career_c']]

X = df.loc[:, df.columns != "match"]
y = df['match']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf = KMeans(n_clusters=4)
kmeans.fit(X)