# Classification and Model Selection

## Classifying Kickstarter Campaigns

Kickstarter is a crowdfunding platform with a community of more than 10 million people comprising of creative, tech enthusiasts who help in bringing new projects to life.

Until now, more than $3 billion dollars have been contributed by the members in fueling creative projects.
The projects can be literally anything – a device, a game, an app, a film, etc.

Kickstarter works on all or nothing basis: a campaign is launched with a certain amount they want to raise, if it doesn't meet its goal, the project owner gets nothing. For example: if a projects's goal is $\$5000$ and it receives $\$4999$ in funding, the project won't be a success.

If you have a project that you would like to post on Kickstarter now, can you predict whether it will be successfully funded or not? Looking into the dataset, what useful information can you extract from it, which variables are informative for your prediction and can you interpret the model?

The goal of this project is to build a classifier to predict whether a project will be successfully funded or not. 

**💡 You can use any algorithm of your choice.**

We will use `sklearn` and the usual data science libraries such as `pandas` and `numpy`.

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

### Baseline Model

In this exercise, we are looking to outperform the performance of a simple `baseline` model. This `baseline` is a simple logistic regression with only two features: `goal_usd` (adjusted goal) and `usa` (whether the campaign happened in the US)

The code to build this `baseline` is shown below:

```Python
from sklearn.linear_model import LogisticRegression

# Conduct some custom processing on your training data
df["usa"] = df["country"] == "US"
df["goal_usd"] = df["goal"] * df["static_usd_rate"]

df = df[["goal_usd", "usa", "state"]]

# Conduct the same processing on your testing data
df_eval["usa"] = df_eval["country"] == "US"
df_eval["goal_usd"] = df_eval["goal"] * df_eval["static_usd_rate"]

df_eval = df_eval[["goal_usd", "usa", "state"]]

X = df.drop(["state"], axis=1)
y = df["state"]

X_eval = df_eval.drop(["state"], axis=1)

model = LogisticRegression()
model.fit(X, y)

y_pred = model.predict(X_eval)
```

### Our Model

To kick things off, let's import and use our favourite data processing library, `pandas`, to retrieve the data that we will use to build a machine learning model.

In this assignment, we are going to load in two datasets. The first, `df`, is going to contain all the data we will need to train and test a model. This will include the labels indicating whether or not the project was successfully funded. The second dataset, `df_eval`, is going to contain all the data that our model will be evaluated on by **KATE**. It does not include the labels indicating project success, so can be viewed as held-out test data. 

We will need to process `df_eval` in exactly the same way as `df`, then use our model trained on `df` to make predictions about `df_eval`. On submission, **KATE** will evaluate these predictions against their labels (which **KATE** has access to).


Run the cell below to load the raw data. Note that `pandas` is pretty smart and can read these ZipFiles into regular `DataFrames`:

In [1]:
import pandas as pd

df = pd.read_csv("data/kickstarter.gz")
df_eval = pd.read_csv("data/kickstarter_eval.gz")

print(df.shape)
print(df_eval.shape)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


(50000, 26)
(10000, 26)


We have also displayed the dimensions of our `df` and `df_eval`. Notice that the `df_eval` is only $10,000$ rows. As mentioned earlier, this will be our test set for submissions to **KATE**.

**This means that we cannot train on `df_eval`**


The aim of this practical is to:
  * Process `df` into an input dataframe `X` and a label dataframe `y`
  * Process `df_eval` into an input dataframe `X_eval` (in the same way as we processed `df` into `X`)
  * Train a classification model of our choice on `X` and `y`
  * Submit our code to KATE, where our model will be evaluated on `X_eval` and `y_eval`

Let's kick things off by checking out our `df`. Note that the `state` column contains our success labels:

In [2]:
df.head()

Unnamed: 0,id,photo,name,blurb,goal,slug,disable_communication,country,currency,currency_symbol,...,location,category,profile,urls,source_url,friends,is_starred,is_backing,permissions,state
0,805910621,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",DOCUMENTARY FILM titled FROM RAGS TO SPIRITUAL...,A MOVIE ABOUT THE WILLINGNESS TO BREAK FREE FR...,125000.0,movie-made-from-book-titled-from-rags-to-spiri...,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0
1,1279627995,"{""small"":""https://ksr-ugc.imgix.net/assets/011...","American Politics, Policy, Power and Profit",Everything you should know about really big go...,9800.0,american-politics-policy-power-and-profit,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0
2,1306016155,"{""small"":""https://ksr-ugc.imgix.net/assets/013...","Drew Jacobs Official ""Kiss Me"" Music Video","Be a part of the new ""Kiss Me"" Official Music ...",2500.0,drew-jacobs-official-kiss-me-music-video,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,1.0
3,658851276,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Still Loved,When their dreams are shattered by the loss of...,10000.0,still-loved,False,GB,GBP,Â£,...,"{""country"":""GB"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,1.0
4,1971770539,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",Nine Blackmon's HATER Film Project,HATER is a mock rock doc about why the Rucker ...,5500.0,nine-blackmons-hater-film-project,False,US,USD,$,...,"{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,,0.0


In addition to using the `.head()` function, let's also retrieve some more information about our data:

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      50000 non-null  int64  
 1   photo                   50000 non-null  object 
 2   name                    49998 non-null  object 
 3   blurb                   49998 non-null  object 
 4   goal                    50000 non-null  float64
 5   slug                    50000 non-null  object 
 6   disable_communication   50000 non-null  bool   
 7   country                 50000 non-null  object 
 8   currency                50000 non-null  object 
 9   currency_symbol         50000 non-null  object 
 10  currency_trailing_code  50000 non-null  bool   
 11  deadline                50000 non-null  int64  
 12  created_at              50000 non-null  int64  
 13  launched_at             50000 non-null  int64  
 14  static_usd_rate         50000 non-null

From the above, we can see that there are $50,000$ projects in `df` and, apart from a handful of columns, most of our data is not null - what a relief!

In [4]:
df_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   id                      10000 non-null  int64  
 1   photo                   10000 non-null  object 
 2   name                    10000 non-null  object 
 3   blurb                   10000 non-null  object 
 4   goal                    10000 non-null  float64
 5   slug                    10000 non-null  object 
 6   disable_communication   10000 non-null  bool   
 7   country                 10000 non-null  object 
 8   currency                10000 non-null  object 
 9   currency_symbol         10000 non-null  object 
 10  currency_trailing_code  10000 non-null  bool   
 11  deadline                10000 non-null  int64  
 12  created_at              10000 non-null  int64  
 13  launched_at             10000 non-null  int64  
 14  static_usd_rate         10000 non-null 

Unlike `df`, `df_eval` contains only $10,000$ projects. Also note that the target variable, `state` is null for all entries - as mentioned earlier, this is stored on **KATE** for evaluating our model when we submit our code.

In [5]:
print("df labels:      ", df.state.unique())
print("df_eval labels: ", df_eval.state.unique())

df labels:       [0. 1.]
df_eval labels:  [nan]


In [6]:
df.permissions.unique()

array([nan, '[]'], dtype=object)

**Notes on the dataset**:
* The target `state` corresponds to a binary outcome: `0` for failed, `1` for successful. 
* The variables `'deadline'`, `'created_at'`, `'launched_at'` are stored in Unix time format.

## Part 1: Preprocessing the data

Although our data is relatively clean, it is not yet in a state where we can train a model. For instance, both `df` and `df_eval` contains columns used for training (features) as well as the target column although this is, of course, null for `df_eval`.

What we have to do now is preprocess our data. Specifically, we need to:
 - Build a training set: `X` and `y`
 - Build an evaluation set: `X_eval`
 
 <br>
 
Let's start by extracting the `state` column from `df` into a variable called `y`:
 - Create a new variable `y` from `df["state"]`
 - Drop the `state` column from `df` and `df_eval`

In [7]:
y = df["state"]

df.drop("state", axis=1, inplace=True)
df_eval.drop("state", axis=1, inplace=True)

### 1.1 Remove redundant columns

After removing the target variable `y` from our input data, we can start processing.

Remove all the columns that **you** think are not salient for this classification task. For instance, `id`, `photo`, `slug`, and `disable_communication` are some features which are not likely to be relevant. The choice of which features to retain, however, is yours to make. Remember to remove the same columns from `df` as `df_eval`.

You can use the `.drop()` function from `pandas` to remove columns (remember to specify `axis=1`)

In [8]:
df.head()

Unnamed: 0,id,photo,name,blurb,goal,slug,disable_communication,country,currency,currency_symbol,...,creator,location,category,profile,urls,source_url,friends,is_starred,is_backing,permissions
0,805910621,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",DOCUMENTARY FILM titled FROM RAGS TO SPIRITUAL...,A MOVIE ABOUT THE WILLINGNESS TO BREAK FREE FR...,125000.0,movie-made-from-book-titled-from-rags-to-spiri...,False,US,USD,$,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
1,1279627995,"{""small"":""https://ksr-ugc.imgix.net/assets/011...","American Politics, Policy, Power and Profit",Everything you should know about really big go...,9800.0,american-politics-policy-power-and-profit,False,US,USD,$,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
2,1306016155,"{""small"":""https://ksr-ugc.imgix.net/assets/013...","Drew Jacobs Official ""Kiss Me"" Music Video","Be a part of the new ""Kiss Me"" Official Music ...",2500.0,drew-jacobs-official-kiss-me-music-video,False,US,USD,$,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
3,658851276,"{""small"":""https://ksr-ugc.imgix.net/assets/011...",Still Loved,When their dreams are shattered by the loss of...,10000.0,still-loved,False,GB,GBP,Â£,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""GB"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,
4,1971770539,"{""small"":""https://ksr-ugc.imgix.net/assets/012...",Nine Blackmon's HATER Film Project,HATER is a mock rock doc about why the Rucker ...,5500.0,nine-blackmons-hater-film-project,False,US,USD,$,...,"{""urls"":{""web"":{""user"":""https://www.kickstarte...","{""country"":""US"",""urls"":{""web"":{""discover"":""htt...","{""urls"":{""web"":{""discover"":""http://www.kicksta...","{""background_image_opacity"":0.8,""should_show_f...","{""web"":{""project"":""https://www.kickstarter.com...",https://www.kickstarter.com/discover/categorie...,,,,


- 'id', 'photo', 'slug', 'disable_communication' not relevant
- 'disable_communication' is False for all training values
- 'friends', 'is_starred','is_backing', 'permissions' are all null
- 'currency', 'currency_symbol', 'currency_trailing_code' are immediately from the 'country'
- 'creator', 'location','profile', 'urls', 'source_url' are all probably not relevant and need heavy formatting

Notes:
- 'creator' could be useful, we'll leave it out for now, but come back and reintroduce
- 'location' vs 'country', is there a benefit to one or the other

In [9]:
# Your code here:
columns_to_drop = [
    'id', 'photo', 'slug', 'disable_communication', 
    'currency', 'currency_symbol', 'currency_trailing_code', 
    'creator', 'location', 'profile', 'urls', 'source_url', 
    'friends', 'is_starred','is_backing', 'permissions'
]

df.drop(columns_to_drop, axis=1, inplace=True)
df_eval.drop(columns_to_drop, axis=1, inplace=True)

### 1.2 Fill null values

Looking at the output of `df.info()` above, we can see that some of the columns which we might be interested in as features contain some null values. Null values are, in general, a problem for machine learning models and can cause your code to break. How you choose to deal with them, however, will depend in large part on how you intend to process your data. For instance, if your input data consists of strings that you wish to generate a word count feature from, you can just fill in the null values with empty strings (`""`).

Thankfully, `pandas` has a helpful function for dealing with null values: the `.fillna()` function. Remember to do the same to `df` as `df_eval`:

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             49998 non-null  object 
 1   blurb            49998 non-null  object 
 2   goal             50000 non-null  float64
 3   country          50000 non-null  object 
 4   deadline         50000 non-null  int64  
 5   created_at       50000 non-null  int64  
 6   launched_at      50000 non-null  int64  
 7   static_usd_rate  50000 non-null  float64
 8   category         50000 non-null  object 
dtypes: float64(2), int64(3), object(4)
memory usage: 3.4+ MB


In [11]:
df_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             10000 non-null  object 
 1   blurb            10000 non-null  object 
 2   goal             10000 non-null  float64
 3   country          10000 non-null  object 
 4   deadline         10000 non-null  int64  
 5   created_at       10000 non-null  int64  
 6   launched_at      10000 non-null  int64  
 7   static_usd_rate  10000 non-null  float64
 8   category         10000 non-null  object 
dtypes: float64(2), int64(3), object(4)
memory usage: 703.2+ KB


In [12]:
# Your code here:
df['name'].fillna('', inplace=True)
df['blurb'].fillna('', inplace=True)

df_eval['name'].fillna('', inplace=True)
df_eval['blurb'].fillna('', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['name'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['blurb'].fillna('', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a 

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             50000 non-null  object 
 1   blurb            50000 non-null  object 
 2   goal             50000 non-null  float64
 3   country          50000 non-null  object 
 4   deadline         50000 non-null  int64  
 5   created_at       50000 non-null  int64  
 6   launched_at      50000 non-null  int64  
 7   static_usd_rate  50000 non-null  float64
 8   category         50000 non-null  object 
dtypes: float64(2), int64(3), object(4)
memory usage: 3.4+ MB


### 1.3 Additional Processing

In the previous two exercises we have covered the most basic steps in preprocessing: dropping redundant columns and working with null values. However, there is *so* much more that we can do to extract useful information from our data. 

For instance, the `blurb` column contains unique strings and so, in its current form, isn't a particularly useful feature. Instead, we could create a new feature representing the length of the `blurb`, or the number of words in the `blurb`. 

Other string-type columns, such as `country`, contain categorical data. As there are a lot of countries represented, we might want to aggregate these into regions (e.g. `Europe`, `Asia`, ...). We can then convert this categorical data into a one-hot encoding using `sklearn`.

What we are describing here is what's known as feature engineering and is an art and a science in its own right. 

Let's start this processing by importing some libraries and functions that can help us create features. Notice that we import the `StandardScaler` from `sklearn`. We can use this function on our numerical data to normalise it, which is an important step in training a machine learning model.

**💡 In the following cell, you can use feature engineering to create features that you think might be useful.**

<br>

You may want to put all your processing within a function (such as `processing()`) or may want to do it just as plain Python code. It's entirely up to you!

However, once you have processed `df` and `df_eval`, you must assign them to input variables `X` and `X_eval`.

In [14]:
df.head()

Unnamed: 0,name,blurb,goal,country,deadline,created_at,launched_at,static_usd_rate,category
0,DOCUMENTARY FILM titled FROM RAGS TO SPIRITUAL...,A MOVIE ABOUT THE WILLINGNESS TO BREAK FREE FR...,125000.0,US,1447162860,1444518329,1444673815,1.0,"{""urls"":{""web"":{""discover"":""http://www.kicksta..."
1,"American Politics, Policy, Power and Profit",Everything you should know about really big go...,9800.0,US,1351709344,1348156038,1349117344,1.0,"{""urls"":{""web"":{""discover"":""http://www.kicksta..."
2,"Drew Jacobs Official ""Kiss Me"" Music Video","Be a part of the new ""Kiss Me"" Official Music ...",2500.0,US,1475174031,1473271187,1473359631,1.0,"{""urls"":{""web"":{""discover"":""http://www.kicksta..."
3,Still Loved,When their dreams are shattered by the loss of...,10000.0,GB,1400972400,1395937256,1397218790,1.680079,"{""urls"":{""web"":{""discover"":""http://www.kicksta..."
4,Nine Blackmon's HATER Film Project,HATER is a mock rock doc about why the Rucker ...,5500.0,US,1425963600,1422742820,1423321493,1.0,"{""urls"":{""web"":{""discover"":""http://www.kicksta..."


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   name             50000 non-null  object 
 1   blurb            50000 non-null  object 
 2   goal             50000 non-null  float64
 3   country          50000 non-null  object 
 4   deadline         50000 non-null  int64  
 5   created_at       50000 non-null  int64  
 6   launched_at      50000 non-null  int64  
 7   static_usd_rate  50000 non-null  float64
 8   category         50000 non-null  object 
dtypes: float64(2), int64(3), object(4)
memory usage: 3.4+ MB


Let's start with category column

In [16]:
df['category'][0]

'{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/film%20&%20video/movie%20theaters"}},"color":16734574,"parent_id":11,"name":"Movie Theaters","id":298,"position":11,"slug":"film & video/movie theaters"}'

In [17]:
df['category'][1]

'{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/publishing/nonfiction"}},"color":14867664,"parent_id":18,"name":"Nonfiction","id":48,"position":9,"slug":"publishing/nonfiction"}'

In [18]:
df['category'][2]

'{"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/music/country%20&%20folk"}},"color":10878931,"parent_id":14,"name":"Country & Folk","id":37,"position":5,"slug":"music/country & folk"}'

In [19]:
# it seems that slug is organised as: general category/precise category
# the parent_id: a class for the general category?
# name/id: info about precise category?

Extract information from the category column using JSON format

In [20]:
import json

category_pd = pd.json_normalize(df.category.apply(json.loads))
category_pd

Unnamed: 0,color,parent_id,name,id,position,slug,urls.web.discover
0,16734574,11,Movie Theaters,298,11,film & video/movie theaters,http://www.kickstarter.com/discover/categories...
1,14867664,18,Nonfiction,48,9,publishing/nonfiction,http://www.kickstarter.com/discover/categories...
2,10878931,14,Country & Folk,37,5,music/country & folk,http://www.kickstarter.com/discover/categories...
3,16734574,11,Documentary,30,4,film & video/documentary,http://www.kickstarter.com/discover/categories...
4,16734574,11,Narrative Film,31,13,film & video/narrative film,http://www.kickstarter.com/discover/categories...
...,...,...,...,...,...,...,...
49995,10878931,14,Hip-Hop,39,8,music/hip-hop,http://www.kickstarter.com/discover/categories...
49996,14867664,18,Children's Books,46,5,publishing/children's books,http://www.kickstarter.com/discover/categories...
49997,10878931,14,Rock,43,17,music/rock,http://www.kickstarter.com/discover/categories...
49998,16760235,1,Installations,288,5,art/installations,http://www.kickstarter.com/discover/categories...


In [21]:
category_pd['slug'].value_counts()[0:10]

slug
publishing/fiction             1613
film & video/shorts            1584
games/video games              1580
music/rock                     1580
music/indie rock               1565
publishing/children's books    1552
games/tabletop games           1543
film & video/webseries         1541
fashion/apparel                1530
film & video/narrative film    1529
Name: count, dtype: int64

In [22]:
general_cats = category_pd['slug'].apply(lambda x: x.split('/')[0])
precise_cats = category_pd['slug'].apply(lambda x: x.split('/')[1])

In [23]:
df['name'].str.split().str.len()

0         8
1         6
2         7
3         2
4         5
         ..
49995     5
49996     9
49997    11
49998     4
49999     5
Name: name, Length: 50000, dtype: int64

In [24]:
general_cats.value_counts()

slug
music           10012
film & video     9883
publishing       6697
games            4502
art              4469
food             3535
fashion          3217
design           2424
comics           1495
photography      1364
crafts           1112
journalism        796
dance             415
technology         79
Name: count, dtype: int64

In [25]:
import json
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Your code here:
#
# def processing(df):
#    ...
#
# X = processing(df)
# X_eval = processing(df_eval)


# first function deals with the category column
def _get_categories(x):
    """
    Helper function to get the category.
    It will only keep the most common ones to avoid building too many columns
    Returns: list where [0] entry is the general cat and [1] is the precise cat
    """
    # Only list the most common categories here, everything else will be 'misc'
    generic_cats = ['music', 'film & video', 'publishing', 'games', 'art']
    precise_cats = ['fiction', 'shorts', 'rock', 'video games', 'indie rock', 'webseries', "children's books", 'documentary']

    # format the categories
    categories = json.loads(x).get('slug').split('/')

    # check if it is a common entry
    if categories[0] not in generic_cats:
        categories[0] = 'misc'
    if categories[1] not in precise_cats:
        categories[1] = 'misc'

    return categories


def process_df(dataframe, scaler=None):
    """
    Here we will do all our preprocessing steps
    Required columns: 'name', 'blurb', 'goal'

    To improve use, scaler=none as option param and then run an if scaler!=None then transform() rather than fit_transform()
    Remember to add return scaler so that it can be passed after training
    """
    # string columns: name and blurb
    # 1) let's get the length of the column
    dataframe['len_name'] = dataframe['name'].str.len()
    dataframe['blurb_len'] = dataframe['blurb'].str.len()
    
    # 2) how many words are in the name/blurb
    dataframe['name_word_count'] = dataframe['name'].str.split().str.len()
    dataframe['blurb_word_count'] = dataframe['blurb'].str.split().str.len()
    
    # 3) the average word length
    dataframe['avg_word_count'] = dataframe['name'].apply(lambda s: np.mean([len(w) for w in s.split()]))
    # nan_rows = dataframe[dataframe['avg_word_count'].isna()]
    # print(nan_rows)
    
    # 4) drop columns that aren't needed
    dataframe.drop(['name', 'blurb'], axis=1, inplace=True)


    # handle the goal column and convert to USD (same as in the baseline code)
    dataframe['goal'] = dataframe['goal'] * dataframe['static_usd_rate']
    dataframe.drop('static_usd_rate', axis=1, inplace=True)


    # optional: helpful for interpreting the columns
    # create some durations (left in the unix time format)
    dataframe['duration_creation'] = dataframe.launched_at - dataframe.created_at
    dataframe['duration_funding'] = dataframe.deadline - dataframe.launched_at
    dataframe.drop(['launched_at', 'created_at'], axis=1, inplace=True)
    # note: converting to datetime will give you access to days of the week, month, and other useful info


    # handle categorical features: country
    df_cat = pd.DataFrame()

    # deal with the country column first
    countries = {
        "US": "US",
        "CA": "Canada",
        "GB": "UK",
        "AU": "Oceania",
        "IE": "Europe",
        "SE": "Europe",
        "CH": "Europe",
        "IT": "Europe",
        "FR": "Europe",
        "NZ": "Oceania",
        "DE": "Europe",
        "NL": "Europe",
        "NO": "Europe",
        "MX": "South America",
        "ES": "Europe",
        "DK": "Europe",
        "BE": "Europe",
        "AT": "Europe",
        "HK": "Asia",
        "SG": "Asia",
        "LU": "Europe",
    }
    # with the country dict, we can map the values
    df_cat['region'] = dataframe['country'].map(countries)
    
    # extract the month using datetime
    df_cat['month'] = pd.to_datetime(dataframe.deadline, unit='s').dt.month
    
    # use helper function for the categories
    df_cat['generic_cat'] = dataframe.category.apply(lambda x: _get_categories(x)[0])
    df_cat['precise_cat'] = dataframe.category.apply(lambda x: _get_categories(x)[1])

    dataframe.drop(['country', 'category', 'deadline'], axis=1, inplace=True)


    # One-hot encode our categorical dataframe (or use get_dummies)
    ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
    df_cat = pd.DataFrame(ohe.fit_transform(df_cat))
    new_column_names = ohe.get_feature_names_out(input_features=['region', 'month', 'generic_cat', 'precise_cat'])
    df_cat.columns = new_column_names

    
    # Scaling the numercial ones
    if scaler!=None:
        dataframe_processed = pd.DataFrame(scaler.transform(dataframe), columns=dataframe.columns, index=dataframe.index)
    else:
        scaler = StandardScaler()
        dataframe_processed = pd.DataFrame(scaler.fit_transform(dataframe), columns=dataframe.columns, index=dataframe.index)

    # join the categorical columns
    dataframe_processed = dataframe_processed.join(df_cat)

    return dataframe_processed, scaler


X, scaler = process_df(df)
X_eval, scaler = process_df(df_eval, scaler)

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


In [26]:
df.head()

Unnamed: 0,goal,len_name,blurb_len,name_word_count,blurb_word_count,avg_word_count,duration_creation,duration_funding
0,125000.0,53,134,8,24,5.75,155486,2489045
1,9800.0,43,131,6,19,6.333333,961306,2592000
2,2500.0,42,52,7,11,5.142857,88444,1814400
3,16800.793,11,120,2,22,5.0,1281534,3753610
4,5500.0,34,125,5,23,6.0,578673,2642107


## Part 2: Training the model

Now that we have separated our data into train and evaluation data, we can start training models and evaluating their performance. At this point, you are welcome to explore any model architecture, so long as it is a **classification** model.

Check out the `sklearn` [documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) for a selection of possible models and implementation examples.

Note that most `sklearn` models have the same interface. Once imported you can create an instantiation of your model (specifying custom settings as you see fit), and assign it to a variable. For instance:

``` Python
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
```

Once you have created your `model` variable, you can call `.fit()` and pass `X` and `y` as arguments.

**💡 For KATE to work, your model must be assigned to a variable called `model`**

**NOTE**: Since with this project your model will be trained directly on KATE, it is limited to models that can be trained under 1min. You will receive a `TimeoutError` if your model takes too long.


In [27]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 42 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   goal                          50000 non-null  float64
 1   len_name                      50000 non-null  float64
 2   blurb_len                     50000 non-null  float64
 3   name_word_count               50000 non-null  float64
 4   blurb_word_count              50000 non-null  float64
 5   avg_word_count                49998 non-null  float64
 6   duration_creation             50000 non-null  float64
 7   duration_funding              50000 non-null  float64
 8   region_Asia                   50000 non-null  float64
 9   region_Canada                 50000 non-null  float64
 10  region_Europe                 50000 non-null  float64
 11  region_Oceania                50000 non-null  float64
 12  region_South America          50000 non-null  float64
 13  r

In [28]:
X_eval.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 42 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   goal                          10000 non-null  float64
 1   len_name                      10000 non-null  float64
 2   blurb_len                     10000 non-null  float64
 3   name_word_count               10000 non-null  float64
 4   blurb_word_count              10000 non-null  float64
 5   avg_word_count                10000 non-null  float64
 6   duration_creation             10000 non-null  float64
 7   duration_funding              10000 non-null  float64
 8   region_Asia                   10000 non-null  float64
 9   region_Canada                 10000 non-null  float64
 10  region_Europe                 10000 non-null  float64
 11  region_Oceania                10000 non-null  float64
 12  region_South America          10000 non-null  float64
 13  re

In [29]:
X['avg_word_count'].fillna(0, inplace=True)
X_eval['avg_word_count'].fillna(0, inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X['avg_word_count'].fillna(0, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_eval['avg_word_count'].fillna(0, inplace=True)


In [30]:
from sklearn.model_selection import GridSearchCV

model = LogisticRegression(max_iter=10000)
hyperparams = {
    "solver": ['liblinear', 'newton-cg', 'lbfgs'],
    "penalty": ['l1', 'l2'],
    "C": [1.5, 3.5, 5.5]
}

gridsearch = GridSearchCV(model, hyperparams, scoring='accuracy')

gridsearch.fit(X, y)

30 fits failed out of a total of 90.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/hlz/miniconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/hlz/miniconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/hlz/miniconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver new

In [31]:
gridsearch.best_params_

{'C': 5.5, 'penalty': 'l1', 'solver': 'liblinear'}

In [32]:
# Your code here:
# model = ...
# X.columns = X.columns.astype(str)
model = LogisticRegression(solver='liblinear', penalty='l1', C=5.5)
model.fit(X, y)

Once trained, we can use the `.score()` function to evaluate our model's performance on the train set. Remember to pass `X` and `y` as arguments.

In [33]:
# Your code here:
model.score(X, y)

0.65528

## Part 3: Making predictions

Now that our model is trained, we can use the `.predict()` function to make predictions for the rows in our data where `y` is not known.
 - Call `.predict()` on the `model` variable, and pass `X_eval`
 - Assign the output of `.predict()` to a variable called `y_pred`

In [34]:
# Your code here:
y_pred = model.predict(X_eval)

Note that in the previous exercise, we used the `.score()` function to evaluate our model on the training data. However, we do not have the ground truth for our `X_eval` data points - to see how well the model performs on the test set, you will have to submit it to **KATE**!