# Midterm Projects

**2. Feature Coercion and Feature Generation**

> We can only feed numbers into our linear regression model, yet often our data is in other forms.  Our goal here is to convert as much of our data as possible into numbers so that we can feed it into our machine learning model.

In class, we saw how to coerce the following types of data.

* Text to datetimes (eg. time since, months ago)
    * Once we have data in date time formats, use `add_datepart` to convert to numbers, and generate additional features
* "almost" numbers to numbers ($40.00, 25% to 40, 25)
* Categorical data
    * Coercing text like cities and neighborhoods into categorical
    * Combining feature variables into other category observations fall below threshold
* Text to booleans ("T", "F", to True False)
    * Identifying data that is *almost* boolean (because dominated by one category) and converting to boolean

> From our data we can generate additional features.

1. DateTime - 
    * `add_datepart` to identify day of week, month, year, etc.
    * Can add time since, or before to relevant dates (eg. Christmas, for retailers)
    * Time since or after internal dates (eg. Listing since user registration)
2. Geographic
    * Distance from a location
    * Zip Codes - converted to categories, and removed the last digit to limit number of zip codes

> **Feature Coercion and Generation Requirements**

1. Basic Requirements
    * Convert text to booleans where applicable
    * All categorical data to multiple features via get dummies

2. Data Munging Requirements
For at least three columns, do the following:
    * Convert columns that begin as type "object" into categories
    * Convert columns that begin as type "object" into datetimes
    * Convert geographic data into numbers (via distance, or zip codes, or categories)
    * Convert "almost" numbers to numbers 
3. Clean Code Requirements
    * We want our code be reusable, for future data science projects, and within this project
    * To that end, **write at least three methods** so that we automate some of the coercion and feature generation
        * To this end, think about function argument of a dataframe, a column, or multiple columns, and outputting a new column, a subset of columns, or a coerced column
        * Take a function that we wrote in class, change or expand upon the function

> The data munging requirements are the most difficult to specify.  Essentially, I am looking to see that you are able to extract data that is not perfectly formatted.  If your data is already pre-formatted, then add an additional dataset via an API, or tie together two datasets.  That will count.  If you have questions, please give email me with a link to your dataset and the data munging work you plan on doing.

## B. Feature Selection and Model Building

> After coercing and generating our feature, we can begin selecting features.  Selecting features allows us to reduce error due to variance, reduce multicollinearity, and increase interpretability of our model.

1. Recursive Feature Elimination with Cross Validation
    * Quickly create multiple models with recursive feature elimination
    * Discover the number of features our model can be limited to before a significant drop in $r^2$ score.
2. Recursive Feature Elimination
    * Once we have discovered the number of features to be selected, we use recursive feature elimination to discover which of those features we should select
    * Use `rfe.support_` to identify those features
3. (Optional) Plot the target variable 
    * Can we improve our scores by either eliminating outliers or taking a log of our target variable?

4. Correlation Analysis
    * In the July 30 class, we will discuss correlation analysis, and see how it can further allow us to prune and combine features.
    1. Use "rank" scatter plots to identify highly correlated features
    2. Use spearman correlations to identify highly correlated features
    3. Use a dendrogram to identify highly correlated features
    

## C. Final Model

In [None]:
* Once you have selected 