## Daily exercise, part 1

Modify the function `plot_close` so that it instead takes a list of stocks and plots the daily closing price for all the stocks in the list in a single plot.

Note that the function should only plot the closing price of a stock if that stock is actually present in the data (hint: use an `if` statement inside the function).

Test the function using `closing_prices.csv` and execute the following function calls:
```
plot_close(df, ['AAPL'])
plot_close(df, ['AAPL', 'AMZN', 'BABA', 'FB'])
plot_close(df, ['AAPL', 'FAKE1', 'FAKE2', 'FB'])
```

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Import data
df_close = pd.read_csv('data/closing_prices.csv')

# Convert to datetime and sort
df_close['Date'] = pd.to_datetime(df_close['Date'])
df_close.sort_values('Date', inplace = True)

# Extract subset
amazon = df_close[df_close['Stock'] == 'AMZN'].copy()

amazon.head()

In [2]:


def plot_close(df, stocks):

    fig, ax = plt.subplots()

    # loop through the list of stocks
    for stock in stocks:
        # check if the stock exists in the data
        if stock in df['Stock'].unique():

            df_subset = df[df['Stock'] == stock]
            # plot closing price
            ax.plot(df_subset['Date'], df_subset['Close'], label=stock)
        else:
            print(f"Stock {stock} is not present in the data and will be skipped.")

    # beautify
    ax.set_xlabel('Date')
    ax.set_ylabel('Closing Price')
    ax.legend()
    ax.set_title('Daily Closing Prices')
    plt.show()


In [None]:
plot_close(df_close, ['AAPL', 'FAKE1', 'FAKE2', 'FB'])


## Daily exercise, part 2

The presence of outliers, i.e. observations with "untypical" values, can heavliy influence our regression results. An important step in statistical analysis is therefore to investigate the presence of potential outliers.

In the simple regression model that we estimated first (i.e., `mpg ~ horsepower`), we saw that the model consistently underpredicted `mpg` at high levels of actual `mpg`. Could this be caused by some car models having untypical/extreme levels of horsepower?

Import `mpg.xlsx` and explore the presence of extreme values of horsepower in the data and its effect on the simple regression model (`mpg ~ horsepower`).

1. Check for outliers in `horsepower`. Present any descriptive and/or graphical analysis that you see fit.


2. Explore how much dropping a single observation, i.e. car model, from the data affects the estimated coefficient on `horsepower`:

    a. Create a function called `get_beta` that estimates a simple regression model and returns the beta coefficient for the explanatory/independent variable. The function should take three inputs: `df` (the dataset), `dep` (column name of the dependent variable) and `indep` (column name for the independent variable). 
    
    b. Ceate a `for` loop where you in each iteration drop an observation from the data and use `get_beta` to retrieve the beta coefficient from that model. Note that in the first iteration you should drop the first observation from the data. In the second iteration you should keep the first observation but drop the second observation. In the third iteration you should keep the first and second observations, but drop the third one, and so on...
    
    c. Show a histogram of the estimated beta coefficients. What is your verdict? Does it seem that the estimated coefficient on `horsepower` is affected by the presence of outliers?




In [10]:
import pandas as pd

# load df
df_mpg = pd.read_excel("data/mpg.xlsx")


In [None]:
##  Descriptive statistics

import matplotlib.pyplot as plt
import numpy as np
horsepower_stats = df_mpg['horsepower'].describe()

# Boxplot and histogram for outliers
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.boxplot(df_mpg['horsepower'].dropna(), vert=False)
plt.title("Boxplot of Horsepower")

plt.subplot(1, 2, 2)
plt.hist(df_mpg['horsepower'].dropna(), bins=20)
plt.title("Histogram of Horsepower")
plt.xlabel("Horsepower")
plt.ylabel("Frequency")


plt.grid()
plt.tight_layout()
plt.show()

# display statistics
print(horsepower_stats)

In [None]:
## Explore how much dropping a single observation

import statsmodels.formula.api as smf

# step 2 a) Define the get_beta function
def get_beta(df, dep, indep):

    formula = f"{dep} ~ {indep}"
    model = smf.ols(formula=formula, data=df).fit()
    return model.params[indep]

# step 2 b) Drop observations one by one and compute beta coefficients
beta_coefficients = []

# clean data by dropping missing values
df_clean = df_mpg.dropna(subset=['mpg', 'horsepower'])

# total observations in the cleaned dataset
n_clean = len(df_clean)

# loop through and drop observations
for i in range(n_clean):

    # drop the i-th observation
    df_temp = df_clean.drop(index=df_clean.index[i])

    # compute beta coefficient for the independent variable
    beta = get_beta(df_temp, 'mpg', 'horsepower')
    beta_coefficients.append(beta)

# step 2 c) Plot histogram of beta coefficients
plt.figure(figsize=(10, 6))
plt.hist(beta_coefficients, bins=20, edgecolor='k', alpha=0.7)
plt.title("Histogram of Beta Coefficients (Effect of Dropping Observations)")
plt.xlabel("Beta Coefficient for Horsepower")
plt.ylabel("Frequency")
plt.grid()
plt.tight_layout()
plt.show()

# summary of beta coefficients
beta_summary = pd.Series(beta_coefficients).describe()
print(beta_summary)

print(" \n Explanation: The coefficients are tightly clustered around the mean, indicating that dropping individual \n observations does not drastically alter the estimated coefficient for horsepower.")
