# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources in the README.md file
- Happy learning!

In [4]:
#Import your libraries

import numpy as np
import pandas as pd

# Challenge 1 - Import and Describe the Dataset

#### In this challenge we will use the `austin_weather.csv` file. 

#### First import it into a data frame called `austin`.

In [5]:
# Your code here

austin = pd.read_csv('../austin_weather.csv')

#### Next, describe the dataset you have loaded: 
- Look at the variables and their types
- Examine the descriptive statistics of the numeric variables 
- Look at the first five rows of all variables to evaluate the categorical variables as well

In [16]:
# Your code here

austin.dtypes

Date                          object
TempHighF                      int64
TempAvgF                       int64
TempLowF                       int64
DewPointHighF                 object
DewPointAvgF                  object
DewPointLowF                  object
HumidityHighPercent           object
HumidityAvgPercent            object
HumidityLowPercent            object
SeaLevelPressureHighInches    object
SeaLevelPressureAvgInches     object
SeaLevelPressureLowInches     object
VisibilityHighMiles           object
VisibilityAvgMiles            object
VisibilityLowMiles            object
WindHighMPH                   object
WindAvgMPH                    object
WindGustMPH                   object
PrecipitationSumInches        object
Events                        object
dtype: object

In [13]:
# Your code here

austin.describe()

Unnamed: 0,TempHighF,TempAvgF,TempLowF
count,1319.0,1319.0,1319.0
mean,80.862775,70.642911,59.902957
std,14.766523,14.045904,14.190648
min,32.0,29.0,19.0
25%,72.0,62.0,49.0
50%,83.0,73.0,63.0
75%,92.0,83.0,73.0
max,107.0,93.0,81.0


In [14]:
# Your code here

austin.head()

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events
0,2013-12-21,74,60,45,67,49,43,93,75,57,...,29.68,29.59,10,7,2,20,4,31,0.46,"Rain , Thunderstorm"
1,2013-12-22,56,48,39,43,36,28,93,68,43,...,30.13,29.87,10,10,5,16,6,25,0,
2,2013-12-23,58,45,32,31,27,23,76,52,27,...,30.49,30.41,10,10,10,8,3,12,0,
3,2013-12-24,61,46,31,36,28,21,89,56,22,...,30.45,30.3,10,10,7,12,4,20,0,
4,2013-12-25,58,50,41,44,40,36,86,71,56,...,30.33,30.27,10,10,7,10,2,16,T,


#### Given the information you have learned from examining the dataset, write down three insights about the data in a markdown cell below

#### Your Insights:

1. There are 21 variables in the dataset. 3 of them are numeric and the rest contain some text.

2. The average temperature in Austin ranged between around 70 degrees F and around 93 degrees F. The highest temperature observed during this period was 107 degrees F and the lowest was 19 degrees F.

3. When we look at the head function, we see that a lot of variables contain numeric data even though these columns are of object type. This means we might have to do some data cleansing.


#### Let's examine the DewPointAvgF variable by using the unique() function to identify what makes this column of object type

Describe what you find in a markdown cell below the code

In [27]:
# Your code here

austin.DewPointAvgF.unique()

array(['49', '36', '27', '28', '40', '39', '41', '26', '42', '22', '48',
       '32', '8', '11', '45', '55', '61', '37', '47', '25', '23', '20',
       '33', '30', '29', '17', '14', '13', '54', '59', '15', '24', '34',
       '35', '57', '50', '53', '60', '46', '56', '51', '31', '38', '62',
       '43', '63', '64', '67', '66', '58', '70', '68', '65', '69', '71',
       '72', '-', '73', '74', '21', '44', '52', '12', '75', '76', '18'],
      dtype=object)

#### Your findings:

In [20]:
# The following is a list of columns misrepresented as object columns, use this list to 
# force the columns to numeric using the as_numeric function in the 
# next cell (make sure to pass the `errors=coerce` argument to the function):

# Hint: you may use a loop to change one column at a time but it is more efficient to use the apply function

wrong_type_columns = ['DewPointHighF', 'DewPointAvgF', 'DewPointLowF', 'HumidityHighPercent', 
                      'HumidityAvgPercent', 'HumidityLowPercent', 'SeaLevelPressureHighInches', 
                      'SeaLevelPressureAvgInches' ,'SeaLevelPressureLowInches', 'VisibilityHighMiles',
                      'VisibilityAvgMiles', 'VisibilityLowMiles', 'WindHighMPH', 'WindAvgMPH', 
                      'WindGustMPH', 'PrecipitationSumInches']

In [30]:
# Your code here

austin[wrong_type_columns] = austin[wrong_type_columns].apply(pd.to_numeric, errors='coerce')

In [32]:
#Check that your code worked by running this cell

austin.dtypes

# Challenge 2 - Handle the Missing Data

#### Now that we have fixed the type mismatch, let's address the missing data.

By coercing the columns to numeric, we have created NaNs in the cells containing characters. We should choose a strategy to address this missing data.

The first step is to examine how many rows contain missing data.

We check how much missing data we have by applying the `.isnull()` function to our dataset. To find the rows with missing data, we apply `.any()` to the function and pass the `axis=1` argument. `austin.isnull().any(axis=1)` will return a column containing true if the row contains at least one missing value and false otherwise. Therefore we must subset our dataframe with this column. This will give us all rows with at least one missing value. 

#### In the next cell, print all rows containing at least one missing value. Use subsetting to obtain the correct number of rows.

In [42]:
# Your code here

austin[austin.isnull().any(axis=1)]

Unnamed: 0,Date,TempHighF,TempAvgF,TempLowF,DewPointHighF,DewPointAvgF,DewPointLowF,HumidityHighPercent,HumidityAvgPercent,HumidityLowPercent,...,SeaLevelPressureAvgInches,SeaLevelPressureLowInches,VisibilityHighMiles,VisibilityAvgMiles,VisibilityLowMiles,WindHighMPH,WindAvgMPH,WindGustMPH,PrecipitationSumInches,Events
4,2013-12-25,58,50,41,44.0,40.0,36.0,86.0,71.0,56.0,...,30.33,30.27,10.0,10.0,7.0,10.0,2.0,16.0,,
6,2013-12-27,60,53,45,41.0,39.0,37.0,83.0,65.0,47.0,...,30.39,30.34,10.0,9.0,7.0,7.0,1.0,11.0,,
7,2013-12-28,62,51,40,43.0,39.0,33.0,92.0,64.0,36.0,...,30.17,30.04,10.0,10.0,7.0,10.0,2.0,14.0,,
42,2014-02-01,76,66,55,62.0,59.0,41.0,81.0,71.0,60.0,...,29.81,29.75,10.0,10.0,9.0,14.0,6.0,26.0,,Rain
51,2014-02-10,60,48,35,49.0,36.0,30.0,82.0,74.0,66.0,...,30.15,30.02,10.0,8.0,4.0,15.0,9.0,23.0,,Rain
66,2014-02-25,71,62,52,65.0,60.0,47.0,93.0,85.0,77.0,...,30.01,29.95,10.0,4.0,1.0,12.0,5.0,20.0,,Rain
95,2014-03-26,66,60,54,55.0,47.0,30.0,78.0,57.0,36.0,...,30.08,29.95,10.0,10.0,6.0,22.0,10.0,33.0,,Rain
102,2014-04-02,82,76,70,69.0,67.0,63.0,97.0,78.0,58.0,...,29.80,29.69,10.0,9.0,4.0,16.0,9.0,30.0,,Rain
103,2014-04-03,82,77,71,69.0,67.0,66.0,90.0,74.0,58.0,...,29.74,29.66,10.0,8.0,5.0,14.0,7.0,25.0,,
104,2014-04-04,74,64,54,69.0,35.0,28.0,93.0,58.0,22.0,...,30.03,29.82,10.0,10.0,5.0,17.0,8.0,28.0,,Rain


#### There are multiple strategies to handle missing data. 

The simplest strategy is to remove all rows containing missing data or all columns containing missing data. This strategy may work in some cases but not others.

We can also fill all missing values with a value to indicate that they were missing like 0, -1, or 99. This may work in cases of categorical data. For continuous data some may opt to fill all missing data with the mean. This strategy is not very optimal since it can increase the fit of the model.

The last strategy is to fill the values using some algorithm. In our case, we will use linear interpolation.

#### Next, count the number of rows in the subsetted dataframe

In [43]:
# Your code here

austin[austin.isnull().any(axis=1)].shape

(136, 21)

#### To evaluate how many rows are missing, find the number of rows in the dataset and find the ratio of missing rows to total rows

In [44]:
# Your code here

austin.shape

austin[austin.isnull().any(axis=1)].shape[0]/austin.shape[0]

0.10310841546626232

#### Since there is a large proportion of missing data, perhaps we should evaluate which columns have the most missing data and remove those columns. For the remaining columns, we will perform a linear approximation of the missing data.

We can find the number of missing rows in each column using the `.isna()` function. We then chain the `.sum` function to the `.isna()` function and find the number of missing rows per column

In [46]:
# Your code here

austin.isna().sum()

Date                            0
TempHighF                       0
TempAvgF                        0
TempLowF                        0
DewPointHighF                   7
DewPointAvgF                    7
DewPointLowF                    7
HumidityHighPercent             2
HumidityAvgPercent              2
HumidityLowPercent              2
SeaLevelPressureHighInches      3
SeaLevelPressureAvgInches       3
SeaLevelPressureLowInches       3
VisibilityHighMiles            12
VisibilityAvgMiles             12
VisibilityLowMiles             12
WindHighMPH                     2
WindAvgMPH                      2
WindGustMPH                     4
PrecipitationSumInches        124
Events                          0
dtype: int64

#### The majority of missing data is in one column. We prefer to remove this column then to fill the missing values in this column since there are too many missing values. 

Remove this column from the dataframe using the `.drop()` function. Use the `inplace=True` argument.

In [48]:
# Your code here 

austin.drop(columns=['PrecipitationSumInches'], inplace=True)

#### Next we will perform linear interpolation of the missing data.

This means that we will use a linear algorithm to estimate the missing data. Linear interpolation assumes that there is a straight line between the points and the missing point will fall on that line. This is a good enough approximation for weather related data. Weather related data is typically a time series. Therefore, we do not want to drop rows from our data if possible. It is prefereable to estimate the missing values rather than remove the rows. However, if you have data from a single point in time, perhaps a better solution would be to remove the rows. 

If you would like to read more about linear interpolation, you can do so [here](https://en.wikipedia.org/wiki/Linear_interpolation).

In the following cell, use the `.interpolate()` function on the entire dataframe. Pass the `inplace=True` argument to the function.

In [49]:
# Your code here

austin.interpolate(inplace=True)

#### Check that your dataframe contains no missing data

In [52]:
# Your code here

austin.isna().sum()

Date                          0
TempHighF                     0
TempAvgF                      0
TempLowF                      0
DewPointHighF                 0
DewPointAvgF                  0
DewPointLowF                  0
HumidityHighPercent           0
HumidityAvgPercent            0
HumidityLowPercent            0
SeaLevelPressureHighInches    0
SeaLevelPressureAvgInches     0
SeaLevelPressureLowInches     0
VisibilityHighMiles           0
VisibilityAvgMiles            0
VisibilityLowMiles            0
WindHighMPH                   0
WindAvgMPH                    0
WindGustMPH                   0
Events                        0
dtype: int64

# Challenge 3 - Processing the Text Column

#### Our dataframe contains one true text column - the Events column. We should evaluate this column to determine how to process it.

Use the `value_counts()` function to evaluate the contents of this column

In [54]:
# Your code here:

austin.Events.value_counts()

                             903
Rain                         192
Rain , Thunderstorm          137
Fog , Rain , Thunderstorm     33
Fog                           21
Thunderstorm                  17
Fog , Rain                    14
Fog , Thunderstorm             1
Rain , Snow                    1
Name: Events, dtype: int64

#### What is the largest number of events in one day (events are separated by a comma)

Enter your answer in the next markdown cell.

#### Your answer:




#### We would like to evaluate all events separately, therefore we will create dummy variables for all events.

We can iterate through each event and create a new column and then assign a 1 to that row if the Events column contains the event and zero otherwise. 

Note that we do not have to exclude one of the columns since some columns will have zero events and others will have more than one. This means that we do not have one column that is a guaranteed linear combination of the remaining columns. 

First, let's create a True/False column to indicate whether the Events column contains the word Rain. We do this using the `str.contains()` function. We apply this function to the Events column.

Create this column to check if rain is in the Events column in the cell below. The output should be a column containing only True and False values.

In [60]:
#Your code here

austin.Events.str.contains('Rain')

0        True
1       False
2       False
3       False
4       False
5       False
6       False
7       False
8       False
9       False
10      False
11      False
12      False
13      False
14      False
15      False
16      False
17      False
18       True
19      False
20       True
21      False
22       True
23      False
24      False
25      False
26      False
27      False
28      False
29      False
        ...  
1289    False
1290    False
1291    False
1292    False
1293    False
1294     True
1295    False
1296    False
1297    False
1298    False
1299    False
1300    False
1301    False
1302     True
1303    False
1304     True
1305    False
1306    False
1307    False
1308    False
1309    False
1310     True
1311     True
1312    False
1313    False
1314    False
1315    False
1316    False
1317    False
1318    False
Name: Events, Length: 1319, dtype: bool

Using the `np.where()` function, we can set our new column to 0 if the condition above is false and 1 if the condition is true. Do this for all weather events and create a column for each one. The list of weather events is provided below.

In [64]:
event_list = ['Snow', 'Fog', 'Rain', 'Thunderstorm']

# Your code here

for l in event_list:
    austin[l] = np.where(austin.Events.str.contains(l), 1, 0)

Now drop the original events column in the cell below. Use the `inplace=True` argument in the cell below. This will remove the Events column from the austin dataframe and leave us with the columns for all 4 events instead.

In [65]:
# Your code here

austin.drop(columns=['Events'], inplace=True)

# Challenge 4 - Sampling and Holdout Sets

#### Now that we have processed the data for machine learning, we will sample the data as well as separate the data to test and training sets.

First, let's sample our data. We do this using the `sample()` function. In the cell below sample 1/2 of the rows in our dataset. Assign this sample to a variable called `austin_sample`

In [70]:
# Your code here:

austin_sample = austin.sample(frac=0.5)

#### Typically when performing a machine learning task, we separate the data into a training set and a test set. 

We first train the model using only the training set. We check our metrics on the training set. We then apply the model to the test set and check our metrics on the test set as well. If the metrics are significantly more optimal on the training set, then we know we have overfit our model. We will need to revise our model to ensure it will be more applicable to data outside the test set.

#### In the next cells we will separate the data into a training set and a test set using the `train_test_split()` function in scikit-learn.

In scikit-learn, we first separate the data to predictor and response variables. This is the standard way of passing datasets into a model in scikit-learn. In the next cell, assign the `TempAvgF` column to `y` and the remaining columns to `X`.

You can do this by first creating a list of column names that do not contain `TempAvgF` 

In [73]:
# Your code here:

col_list = [x for x in austin.columns.values if x != 'TempAvgF']
y = austin['TempAvgF']
X = austin[col_list]

In the next cell, import `train_test_split` from `sklearn.model_selection`

In [75]:
#Your code here:

from sklearn.model_selection import train_test_split

Now that we have split the data to predictor and response variables and imported the `train_test_split()` function, split `X` and `y` into `X_train`, `X_test`, `y_train`, and `y_test`. 80% of the data should be in the training set and 20% in the test set. Enter your code in the cell below

In [76]:
#Your code here:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### While this is common practice for most data, when it comes to time series data, we do not want to randomly select rows from our dataset.

This is because many time series algorithms rely on observations having equal time distance between them. In such cases, we typically select the majority of rows as the test data and the last few rows as the training data.

In the following cell, compute the number of rows that is 80% of our data and round it to the next integer. Assign this number to `ts_rows`

In [79]:
# Your code here:

ts_rows = round(austin.shape[0] * 0.8)

Assign the first `ts_rows` rows of `X` to `X_ts_train` and the remaining rows to `X_ts_test`.

In [103]:
# Your code here:

X_ts_train, X_ts_test = X[:ts_rows], X[ts_rows:]

Assign the first `ts_rows` rows of `y` to `y_ts_train` and the remaining rows to `y_ts_test`.

In [104]:
# Your code here:

y_ts_train, y_ts_test = y[:ts_rows], y[ts_rows:]