# Putting it all together

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 5 exercises "Preprocessing for Machine Learning in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## UFOs and preprocessing

### Checking column types

<p>Take a look at the UFO dataset's column types using the <code>dtypes</code> attribute. Two columns jump out for transformation: the seconds column, which is a numeric column but is being read in as <code>object</code>, and the <code>date</code> column, which can be transformed into the <code>datetime</code> type. That will make our feature engineering efforts easier later on.</p>

In [39]:
ufo = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/ufo.csv')

Instructions
<ul>
<li>Print out the <code>dtypes</code> of the <code>ufo</code> dataset.</li>
<li>Change the type of the <code>seconds</code> column by passing the <code>float</code> type into the <code>astype()</code> method.</li>
<li>Change the type of the <code>date</code> column by passing <code>ufo["date"]</code> into the <code>pd.to_datetime()</code> function.</li>
<li>Print out the <code>dtypes</code> of the <code>seconds</code> and <code>date</code> columns, to make sure it worked.</li>
</ul>

In [5]:
# Check the column types
print(ufo.dtypes)

# Change the type of seconds to float
ufo['seconds'] = ufo['seconds'].astype(float)

# Change the date column to type datetime
ufo['date'] = pd.to_datetime(ufo['date'])

# Check the column types
print(ufo[['seconds', 'date']].dtypes)

date               object
city               object
state              object
country            object
type               object
seconds           float64
length_of_time     object
desc               object
recorded           object
lat                object
long              float64
dtype: object
seconds           float64
date       datetime64[ns]
dtype: object


**Nice job on transforming the column types! This will make feature engineering and standardization easier.**

### Dropping missing data


<p>Let's remove some of the rows where certain columns have missing values. We're going to look at the <code>length_of_time</code> column, the <code>state</code> column, and the <code>type</code> column. If any of the values in these columns are missing, we're going to drop the rows.</p>

<ul>
<li>Check how many values are missing in the <code>length_of_time</code>, <code>state</code>, and <code>type</code> columns, using <code>isnull()</code> to check for nulls and <code>sum()</code> to calculate how many exist.</li>
<li>Use boolean indexing to filter out the rows with those missing values, using <code>notnull()</code> to check the column. Here, we can chain together each column we want to check.</li>
<li>Print out the <code>shape</code> of the new <code>ufo_no_missing</code> dataset.</li>
</ul>

In [6]:
# Check how many values are missing in the length_of_time, state, and type columns
print(ufo[['length_of_time', 'state', 'type']].isnull().sum())

# Keep only rows where length_of_time, state, and type are not null
ufo_no_missing = ufo[ufo['length_of_time'].notnull() & 
          ufo['state'].notnull() & 
          ufo['type'].notnull()
          ]

# Print out the shape of the new dataset
print(ufo_no_missing.shape)

length_of_time    143
state             419
type              159
dtype: int64
(4283, 11)


**We'll work with this set going forward.**

## Categorical variables and standardization


### Extracting numbers from strings


<p>The <code>length_of_time</code> field in the UFO dataset is a text field that has the number of minutes within the string. Here, you'll extract that number from that text field using regular expressions.</p>

In [3]:
import re
ufo = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/8-preprocessing-for-machine-learning-in-python/datasets/ufo_1866x11.csv')

Instructions
<ul>
<li>Pass <code>\d+</code> into <code>re.compile()</code> in the <code>pattern</code> variable to designate that we want to grab as many digits as possible from the string.</li>
<li>Into <code>re.match()</code>, pass the <code>pattern</code> we just created, as well as the <code>time_string</code> we want to extract from.</li>
<li>Use <code>lambda</code> within the <code>apply()</code> method to perform the extraction.</li>
<li>Print out the <code>head()</code> of both the <code>length_of_time</code> and <code>minutes</code> columns to compare.</li>
</ul>

In [4]:
def return_minutes(time_string):

    # Use \d+ to grab digits
    pattern = re.compile(r"\d+")

    # Use match on the pattern and column
    num = re.match(pattern, time_string)
    if num is not None:
        return int(num.group(0))
        
# Apply the extraction to the length_of_time column
ufo["minutes"] = ufo["length_of_time"].apply(lambda row: return_minutes(row))

# Take a look at the head of both of the columns
print(ufo[['length_of_time', 'minutes']].head(5))

    length_of_time  minutes
0  about 5 minutes      NaN
1       10 minutes     10.0
2        2 minutes      2.0
3        2 minutes      2.0
4        5 minutes      5.0


**As you can see, we end up with some NaNs in the DataFrame. That's okay for now; we'll take care of those before modeling.**

### Identifying features for standardization


<p>In this section, you'll investigate the variance of columns in the UFO dataset to determine which features should be standardized. After taking a look at the variances of the <code>seconds</code> and <code>minutes</code> column, you'll see that the variance of the <code>seconds</code> column is extremely high. Because <code>seconds</code> and <code>minutes</code> are related to each other (an issue we'll deal with when we select features for modeling), let's log normlize the <code>seconds</code> column.</p>

Instructions
<ul>
<li>Use the <code>var()</code> method on the <code>seconds</code> and <code>minutes</code> columns to check the variance. Notice how high the variance is on the <code>seconds</code> column.</li>
<li>Using <code>np.log()</code> perform log normalization on the <code>seconds</code> column, transforming it into a new column named <code>seconds_log</code>.</li>
<li>Print out the variance of the <code>seconds_log</code> column.</li>
</ul>

In [5]:
# Check the variance of the seconds and minutes columns
print(ufo[['seconds', 'minutes']].var())

# Log normalize the seconds column
ufo["seconds_log"] = np.log(ufo['seconds'])

# Print out the variance of just the seconds_log column
print(ufo["seconds_log"].var())

seconds    424087.417474
minutes       117.546372
dtype: float64
1.122392388118297


**In the next section, we'll focus on engineering new features.**

### Engineering new features

### Encoding categorical variables


<p>There are couple of columns in the UFO dataset that need to be encoded before they can be modeled through scikit-learn. You'll do that transformation here, using both binary and one-hot encoding methods.</p>

Instructions
<ul>
<li>Using <code>apply()</code>, write a <code>lambda</code> that returns a 1 if the value is <code>us</code>, else return 0. This is something we learned in Chapter 3 if you need a refresher.</li>
<li>Next, print out the number of <code>unique()</code> values of the <code>type</code> column.</li>
<li>Using <code>pd.get_dummies()</code>, create a one-hot encoded set of the <code>type</code> column.</li>
<li>Finally, use <code>pd.concat()</code> to concatenate the <code>ufo</code> dataset to the <code>type_set</code> encoded variables.</li>
</ul>

In [6]:
# Use Pandas to encode us values as 1 and others as 0
ufo["country_enc"] = ufo["country"].apply(lambda value: 1 if value == 'us' else 0)

# Print the number of unique type values
print(len(ufo["type"].unique()))

# Create a one-hot encoded set of the type values
type_set = pd.get_dummies(ufo['type'])

# Concatenate this set back to the ufo DataFrame
ufo = pd.concat([ufo, type_set], axis=1)

21


**Let's continue on by extracting some date parts.**

### Features from dates


<p>Another feature engineering task to perform is month and year extraction. Perform this task on the <code>date</code> column of the <code>ufo</code> dataset.</p>

In [7]:
ufo['date'] = pd.to_datetime(ufo['date'])

Instructions
<ul>
<li>Print out the <code>head()</code> of the <code>date</code> column.</li>
<li>Using <code>apply()</code>, <code>lambda</code>, and the <code>.month</code> attribute, extract the month from the <code>date</code> column.</li>
<li>Using <code>apply()</code>, <code>lambda</code>, and the <code>.year</code> attribute, extract the year from the <code>date</code> column.</li>
<li>Take a look at the <code>head()</code> of the <code>date</code>, <code>month</code>, and <code>year</code> columns.</li>
</ul>

In [8]:
# Look at the first 5 rows of the date column
print(ufo['date'].head())

# Extract the month from the date column
ufo["month"] = ufo["date"].apply(lambda date: date.month)

# Extract the year from the date column
ufo["year"] = ufo["date"].apply(lambda date: date.year)

# Take a look at the head of all three columns
print(ufo[["date", "month", "year"]].head())

0   2002-11-21 05:45:00
1   2012-06-16 23:00:00
2   2013-06-09 00:00:00
3   2013-04-26 23:27:00
4   2013-09-13 20:30:00
Name: date, dtype: datetime64[ns]
                 date  month  year
0 2002-11-21 05:45:00     11  2002
1 2012-06-16 23:00:00      6  2012
2 2013-06-09 00:00:00      6  2013
3 2013-04-26 23:27:00      4  2013
4 2013-09-13 20:30:00      9  2013


**Nice job on extracting dates! 'apply' and 'lambda' are extremely useful for extraction tasks.**

### Text vectorization


<p>Let's transform the <code>desc</code> column in the UFO dataset into tf/idf vectors, since there's likely something we can learn from this field.</p>

Instructions
<ul>
<li>Print out the <code>head()</code> of the <code>ufo["desc"]</code> column.</li>
<li>Set <code>vec</code> equal to the <code>TfidfVectorizer()</code> object.</li>
<li>Use <code>vec</code>'s <code>fit_transform()</code> method on the <code>ufo["desc"]</code> column.</li>
<li>Print out the <code>shape</code> of the <code>desc_tfidf</code> vector, to take a look at the number of columns this created. The output is in the shape (rows, columns).</li>
</ul>

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Take a look at the head of the desc field
print(ufo["desc"].head())

# Create the tfidf vectorizer object
vec = TfidfVectorizer()

# Use vec's fit_transform method on the desc field
desc_tfidf = vec.fit_transform(ufo["desc"])

# Look at the number of columns this creates
print(desc_tfidf.shape)

0    It was a large&#44 triangular shaped flying ob...
1    Dancing lights that would fly around and then ...
2    Brilliant orange light or chinese lantern at o...
3    Bright red light moving north to north west fr...
4    North-east moving south-west. First 7 or so li...
Name: desc, dtype: object
(1866, 3422)


**You'll notice that the text vector has a large number of columns. We'll work on selecting the features we want to use for modeling in the next section.**

## Feature selection and modeling


### Selecting the ideal dataset


<div class=""><p>Let's get rid of some of the unnecessary features. Because we have an encoded country column, <code>country_enc</code>, keep it and drop other columns related to location: <code>city</code>, <code>country</code>, <code>lat</code>, <code>long</code>, <code>state</code>. </p>
<p>We have columns related to <code>month</code> and <code>year</code>, so we don't need the <code>date</code> or <code>recorded</code> columns. </p>
<p>We vectorized <code>desc</code>, so we don't need it anymore. For now we'll keep <code>type</code>. </p>
<p>We'll keep <code>seconds_log</code> and drop <code>seconds</code> and <code>minutes</code>. </p>
<p>Let's also get rid of the <code>length_of_time</code> column, which is unnecessary after extracting <code>minutes</code>.</p></div>

In [17]:
def return_weights(vocab, original_vocab, vector, vector_index, top_n):
    zipped = dict(zip(vector[vector_index].indices, vector[vector_index].data))
    zipped_series = pd.Series({vocab[i]:zipped[i] for i in vector[vector_index].indices})
    zipped_index = zipped_series.sort_values(ascending=False)[:top_n].index
    return [original_vocab[i] for i in zipped_index]

def words_to_filter(vocab, original_vocab, vector, top_n):
    filter_list = []
    
    for i in range(0, vector.shape[0]):
        filtered = return_weights(vocab, original_vocab, vector, i, top_n)
        filter_list.extend(filtered)
        
    return set(filter_list)
    
vocab = {v:k for k,v in vec.vocabulary_.items()}

Instructions
<ul>
<li>Use <code>.corr()</code> to run the correlation on <code>seconds</code>, <code>seconds_log</code>, and <code>minutes</code> in the <code>ufo</code> DataFrame.</li>
<li>Make a list of columns to drop, in alphabetical order.</li>
<li>Use <code>drop()</code> to drop the columns.</li>
<li>Use the <code>words_to_filter()</code> function we created previously. Pass in <code>vocab</code>, <code>vec.vocabulary_</code>, <code>desc_tfidf</code>, and let's keep the top <code>4</code> words as the last parameter.</li>
</ul>

In [26]:
# Check the correlation between the seconds, seconds_log, and minutes columns
print(ufo[['seconds', 'seconds_log', 'minutes']].corr())

# Make a list of features to drop
to_drop = ['city', 'country', 'lat', 'long', 'state',
'date', 'recorded', 'desc', 'seconds', 'minutes', 'length_of_time']

# Drop those features
ufo_dropped = ufo.drop(to_drop, 1)

# Let's also filter some words out of the text vector we created
filtered_words = words_to_filter(vocab, vec.vocabulary_, desc_tfidf, 4)

              seconds  seconds_log   minutes
seconds      1.000000     0.853371  0.980341
seconds_log  0.853371     1.000000  0.824493
minutes      0.980341     0.824493  1.000000


**You're almost done. In the next exercises, we'll try modeling the UFO data in a couple of different ways.**

### Modeling the UFO dataset, part 1

<p>In this exercise, we're going to build a k-nearest neighbor model to predict which country the UFO sighting took place in. Our <code>X</code> dataset has the log-normalized seconds column, the one-hot encoded type columns, as well as the month and year when the sighting took place. The <code>y</code> labels are the encoded country column, where 1 is <code>us</code> and 0 is <code>ca</code>.</p>

In [33]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X = ufo_dropped.drop(['type', 'country_enc'], 1)
y = ufo_dropped['country_enc']

knn = KNeighborsClassifier()

Instructions
<ul>
<li>Print out the <code>.columns</code> of the <code>X</code> set.</li>
<li>Split up the <code>X</code> and <code>y</code> sets using <code>train_test_split()</code>. Pass the <code>y</code> set to the <code>stratify=</code> parameter, since we have imbalanced classes here.</li>
<li>Use <code>fit()</code> to fit <code>train_X</code> and <code>train_y</code>.</li>
<li>Print out the <code>.score()</code> of the <code>knn</code> model on the <code>test_X</code> and <code>test_y</code> sets.</li>
</ul>

In [37]:
# Take a look at the features in the X set of data
print(X.columns)

# Split the X and y sets using train_test_split, setting stratify=y
train_X, test_X, train_y, test_y = train_test_split(X, y, stratify=y)

# Fit knn to the training sets
knn.fit(train_X, train_y)

# Print the score of knn on the test sets
print(knn.score(test_X, test_y))

Index(['seconds_log', 'changing', 'chevron', 'cigar', 'circle', 'cone',
       'cross', 'cylinder', 'diamond', 'disk', 'egg', 'fireball', 'flash',
       'formation', 'light', 'other', 'oval', 'rectangle', 'sphere',
       'teardrop', 'triangle', 'unknown', 'month', 'year'],
      dtype='object')
0.8779443254817987


**This model performs pretty well. It seems like we've made pretty good feature selection choices here.**

### Modeling the UFO dataset, part 2


<p>Finally, let's build a model using the text vector we created, <code>desc_tfidf</code>, using the <code>filtered_words</code> list to create a filtered text vector. Let's see if we can predict the <code>type</code> of the sighting based on the text. We'll use a Naive Bayes model for this.</p>

In [42]:
from sklearn.naive_bayes import GaussianNB
y = ufo_dropped['type']
nb = GaussianNB()

Instructions
<ul>
<li>On the <code>desc_tfidf</code> vector, filter by passing a list of <code>filtered_words</code> into the index.</li>
<li>Split up the <code>X</code> and <code>y</code> sets using <code>train_test_split()</code>. Remember to convert <code>filtered_text</code> using <code>toarray()</code>. Pass the <code>y</code> set to the <code>stratify=</code> parameter, since we have imbalanced classes here.</li>
<li>Use the <code>nb</code> model's <code>fit()</code> to fit <code>train_X</code> and <code>train_y</code>.</li>
<li>Print out the <code>.score()</code> of the <code>nb</code> model on the <code>test_X</code> and <code>test_y</code> sets.</li>
</ul>

In [43]:
# Use the list of filtered words we created to filter the text vector
filtered_text = desc_tfidf[:, list(filtered_words)]

# Split the X and y sets using train_test_split, setting stratify=y 
train_X, test_X, train_y, test_y = train_test_split(filtered_text.toarray(), y, stratify=y)

# Fit nb to the training sets
nb.fit(train_X, train_y)

# Print the score of nb on the test sets
nb.score(test_X, test_y)

0.15845824411134904

**As you can see, this model performs very poorly on this text data. This is a clear case where iteration would be necessary to figure out what subset of text improves the model, and if perhaps any of the other features are useful in predicting type.**