# Preprocessing for Machine Learning

Skills and best practices for preparing data for modeling.

## What is data preprocessing?

Data preprocessing is a crucial step that comes after exploring and cleaning our dataset. Once we understand the dataset's contents, structure, and quality, we typically form an idea of how we want to model it. Preprocessing involves getting the data ready for modeling, often by transforming categorical features into numerical ones since most machine learning models in Python require numerical input. This step is essential and common in data analysis and modeling.

## Why preprocess?

The goal of preprocessing is not only to transform our dataset into a form that suitable for modeling, but also to improve the performance of our models, and in turn, produce more reliable results.

In [1]:
import pandas as pd # type: ignore
import numpy as np # type: ignore
import matplotlib.pyplot as plt # type: ignore
import seaborn as sns # type: ignore

## Download the data

In [3]:
# Dictionary with dataset names as keys and URLs as values
dataset_urls = {
    'hiking': 'https://assets.datacamp.com/production/repositories/1816/datasets/4f26c48451bdbf73db8a58e226cd3d6b45cf7bb5/hiking.json',
    'wine': 'https://assets.datacamp.com/production/repositories/1816/datasets/9bd5350dfdb481e0f94eeef6acf2663452a8ef8b/wine_types.csv',
    'ufo':'https://assets.datacamp.com/production/repositories/1816/datasets/a5ebfe5d2ed194f2668867603b563963af4769e9/ufo_sightings_large.csv',
    'volunteer':'https://assets.datacamp.com/production/repositories/1816/datasets/668b96955d8b252aa8439c7602d516634e3f015e/volunteer_opportunities.csv'
}

def fetch_data_and_create_dataframe(dataset_urls):
    dataframes = {}  # Dictionary to store DataFrames

    for dataset, url in dataset_urls.items():
        try:
            # Determine file format based on the dataset name
            file_format = 'json' if dataset == 'hiking' else 'csv'
            
            # Read the data from the URL into a DataFrame
            if file_format == 'json':
                df = pd.read_json(url)
            else:
                df = pd.read_csv(url)

            dataframes[dataset] = df
            print(f"Successfully fetched {dataset} data.")
        except Exception as e:
            print(f"Error fetching {dataset} data: {str(e)}")

    return dataframes

# Call the function with your dataset URLs
fetched_dataframes = fetch_data_and_create_dataframe(dataset_urls)

# Now 'resulting_dataframes' is a dictionary where keys are dataset names and values are DataFrames
# You can access each DataFrame using, for example, fetched_dataframes['hiking']

Successfully fetched hiking data.
Successfully fetched wine data.
Successfully fetched ufo data.
Successfully fetched volunteer data.


One of the first steps after importing data is to inspect it, which we can do with the `head()` method.

In [None]:
# Display the hiking dataset
hiking = fetched_dataframes['hiking']
display(hiking.head())

## Exploring with pandas

In [None]:
# Number of records with missing values
print(hiking.info())

In [None]:
# Summary statistics
print(fetched_dataframes['wine'].describe())

## Removing missing data

In [None]:
data = {
    'A': [1.0, 4.0, 7.0, np.nan, 5.0],
    'B': [np.nan, 7.0, np.nan, 7.0, 9.0],
    'C': [2.0, 3.0, np.nan, np.nan, 7.0],
}

df = pd.DataFrame(data)
display(df)


In [None]:
# drop all rows containing missing values
display(df.dropna())

In [None]:
# drop specific rows using index labels (defaults to dropping rows)
display(df.drop([1, 2, 3]))

In [None]:
# drop a specific column especially if most or all of its columns are missing
# axis=1 means we want to drop a column rather than a row
display(df.drop("A", axis=1))

In [None]:
# drop rows where data is missing in a particular column

# first - how many values we have in each column
display(df.isna().sum())

In [None]:
# second - specify a list of labels to dropna
# here, drop those rows where there's missing values in column B
display(df.dropna(subset=["B"]))

In [None]:
# specify how many missing values we require in each row
display(df.dropna(thresh=2))

## Merging datasets

In [None]:
import pandas as pd

# Gas prices dataset
gas_prices_data = {'date': ['2023-01-01', '2023-01-05', '2023-02-15', '2023-03-17', '2023-04-23', '2023-04-24'],
                   'price': [2.00, 3.00, 2.00, 1.00, 3.00, 2.50]}
gas_prices_df = pd.DataFrame(gas_prices_data)

# Shipment history dataset
shipment_history_data = {'date': ['2023-01-01', '2023-01-02', '2023-02-15', '2023-03-01', '2023-04-23', '2023-05-11'],
                          'quantity': [1000, 5000, 500, 200, 1500, 2500]}
shipment_history_df = pd.DataFrame(shipment_history_data)

# Car sales dataset
car_sales_data = {'date': ['2023-03-02', '2023-03-27', '2023-04-28', '2023-05-15', '2023-07-06', '2023-07-23', '2023-08-09', '2023-08-17'],
                  'sales': [5020, 10020, 30102, 200, 1500, 2500, 500, 2150]}
car_sales_df = pd.DataFrame(car_sales_data)

# Merging all three datasets on 'date' with a full outer join
merged_data = pd.merge(gas_prices_df, shipment_history_df, on='date', how='outer')
merged_data = pd.merge(merged_data, car_sales_df, on='date', how='outer')

# Sorting the DataFrame by 'date' to ensure correct filling order
merged_data['date'] = pd.to_datetime(merged_data['date'])
merged_data.sort_values(by='date', inplace=True)

# Filling missing values based on the specified method
merged_data.ffill(inplace=True)
merged_data.bfill(inplace=True)

# If you want to see the result
print(merged_data)


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Assuming you have already merged and filled the data as mentioned in the previous code

# Plotting the line graph
plt.figure(figsize=(12, 6))

plt.plot(merged_data['date'], merged_data['price'], label='Gas Price', marker='o')
plt.plot(merged_data['date'], merged_data['quantity'], label='Shipment Quantity', marker='o')
plt.plot(merged_data['date'], merged_data['sales'], label='Car Sales', marker='o')

plt.title('Gas Prices, Shipment Quantity, and Car Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Assuming you have already merged and filled the data as mentioned in the previous code

# Set Seaborn style
sns.set(style="darkgrid")

# Plotting the line graph with Seaborn
plt.figure(figsize=(12, 6))

sns.lineplot(x='date', y='price', data=merged_data, label='Gas Price', marker='o')
sns.lineplot(x='date', y='quantity', data=merged_data, label='Shipment Quantity', marker='o')
sns.lineplot(x='date', y='sales', data=merged_data, label='Car Sales', marker='o')

plt.title('Gas Prices, Shipment Quantity, and Car Sales Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()


In [None]:
# Assuming you have already merged and filled the data as mentioned in the previous code

# Set Seaborn style
sns.set(style="darkgrid")

# Plotting the smoothened line graph with Seaborn
plt.figure(figsize=(12, 6))

sns.lineplot(x='date', y='price', data=merged_data, label='Gas Price', marker='o', err_style=None, estimator='lowess')
sns.lineplot(x='date', y='quantity', data=merged_data, label='Shipment Quantity', marker='o', err_style=None, estimator='lowess')
sns.lineplot(x='date', y='sales', data=merged_data, label='Car Sales', marker='o', err_style=None, estimator='lowess')

plt.title('Gas Prices, Shipment Quantity, and Car Sales Over Time (Smoothened)')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.show()


In [None]:
# Display the wine dataset
wine = fetched_dataframes['wine']
display(wine.info())

In [None]:
display(wine.describe())

In [None]:
volunteer = fetched_dataframes["volunteer"]
display(volunteer.info())

In [None]:
# drop the Latitude and Longitude columns 
volunteer_cols = volunteer.drop(["Latitude","Longitude"], axis=1)
volunteer_cols.info()

In [None]:
# Subset volunteer_cols by dropping rows containing missing values in the category_desc.
volunteer_subset = volunteer_cols.dropna(subset=["category_desc"])


In [None]:
#  Verify
volunteer_subset.shape

## Working with data types

In [None]:
print(volunteer.info())

`datetime64` data type unlocks time series functionality.
datetime64 is another common data type that stores date and time data. This special data type unlocks a bunch of extra functionality for working with time series data, such as datetime indexing, adding timezone information, and selecting a datetime sampling frequency.