## Linear regression

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
%matplotlib inline

**Check out the Data**

In [None]:
USAhousing = pd.read_csv('USA_Housing.csv')

In [None]:
USAhousing.head()

In [None]:
USAhousing.info()

In [None]:
USAhousing.describe()

In [None]:
USAhousing.columns

# EDA

Let's create some simple plots to check out the data!

In [None]:
sns.displot(USAhousing['Price'], bins=15)

In [None]:
sns.heatmap(USAhousing.corr(),annot=True)
#sns.heatmap(USAhousing.corr(numeric_only=True),annot=True)

#!pip install --upgrade pandas if you have an error

## Training a Linear Regression Model

Let's now begin to train out regression model! We will need to first split up our data into an X array that contains the features to train on, and a y array with the target variable, in this case the Price column. We will toss out the Address column because it only has text info that the linear regression model can't use.

### X and y arrays

In [None]:
X = USAhousing[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
               'Avg. Area Number of Bedrooms', 'Area Population']]
y = USAhousing['Price']

## Train Test Split

Now let's split the data into a training set and a testing set. We will train out model on the training set and then use the test set to evaluate the model.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=101)

## Creating and Training the Model

In [None]:
from sklearn.linear_model import LinearRegression


In [None]:
lm = LinearRegression()

In [None]:
lm_result=lm.fit(X_train,y_train)

## Model Evaluation

Let's evaluate the model by checking out it's coefficients and how we can interpret them.

In [None]:
lm.coef_

In [None]:
coeff_df = pd.DataFrame(lm.coef_,X.columns)
coeff_df

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **Avg. Area Income** is associated with an **increase of \$21.52 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area House Age** is associated with an **increase of \$164883.28 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Rooms** is associated with an **increase of \$122368.67 **.
- Holding all other features fixed, a 1 unit increase in **Avg. Area Number of Bedrooms** is associated with an **increase of \$2233.80 **.
- Holding all other features fixed, a 1 unit increase in **Area Population** is associated with an **increase of \$15.15 **.

Does this make sense? Probably not because I made up this data. If you want real data to repeat this sort of analysis, check out the [boston dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html):



    from sklearn.datasets import load_boston
    boston = load_boston()
    print(boston.DESCR)
    boston_df = boston.data

## P-value

In [None]:
#!pip install statsmodels

In [None]:
import statsmodels.api as sm
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())

#Kurtosis: fatness of the tail (<3: short tail)
#durbin-watson: test of autocorrelations (<2, position correlation; >
#2 negative autocorrelation)
#jarque-bera: goodness of fit test. 
#(the farther from 0, the more unlikely that it is normal distribution)

## Predictions from our Model

Let's grab predictions off our test set and see how well it did!

In [None]:
predictions = lm.predict(X_test)

In [None]:
plt.scatter(y_test,predictions)

**Residual Histogram**

In [None]:
sns.displot((y_test-predictions),bins=50);

## Regression Evaluation Metrics


Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

- **MAE** is the easiest to understand, because it's the average error.
- **MSE** is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
- **RMSE** is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are **loss functions**, because we want to minimize them.

In [None]:
from sklearn import metrics

In [None]:
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))

lets do some exercise of linear regression

# Decision Trees and Random Forests in Python

This is the code for the lecture video which goes over tree methods in Python. Reference the video lecture for the full explanation of the code!

I also wrote a [blog post](https://medium.com/@josemarcialportilla/enchanted-random-forest-b08d418cb411#.hh7n1co54) explaining the general logic of decision trees and random forests which you can check out. 

## Import Libraries

In [None]:
pip install xgboost

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Get the Data

In [None]:
df = pd.read_csv('kyphosis.csv')

In [None]:
df.head()

## EDA

We'll just check out a simple pairplot for this small dataset.

In [None]:
sns.pairplot(df,hue='Kyphosis',palette='Set1')

## Train Test Split

Let's split up the data into a training set and a test set!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

## Decision Trees

We'll start just by training a single decision tree.

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier()

In [None]:
dtree.fit(X_train,y_train)

## Prediction and Evaluation 

Let's evaluate our decision tree.

In [None]:
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix


In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

## Random Forests

Now let's compare the decision tree model to a random forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100) 
#number of trees in the classifer
rfc.fit(X_train, y_train)

In [None]:
rfc_pred = rfc.predict(X_test)

In [None]:
print(confusion_matrix(y_test,rfc_pred))

In [None]:
print(classification_report(y_test,rfc_pred))

# XGBoost

In [None]:
import xgboost as xgb


In [None]:
xgb_cl = xgb.XGBClassifier()

y_train=y_train.astype('category').cat.codes
y_test=y_test.astype('category').cat.codes

y_train.head(5)


In [None]:
xgb_cl.fit(X_train, y_train)


In [None]:
preds = xgb_cl.predict(X_test)

In [None]:
print(confusion_matrix(y_test,preds))


In [None]:
print(classification_report(y_test,preds))

## hypertuning

======> lets start from yesetday

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
import xgboost as xgb

df = pd.read_csv("kyphosis.csv")

# declear a classification xgboost model
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)



y_train = y_train.astype('category').cat.codes
y_test = y_test.astype('category').cat.codes

xgb_cl = xgb.XGBClassifier()


In [2]:
xgb_cl

Commonly used parameters:
1. learning_rate: also called eta, it specifies how quickly the model fits the residual errors by using additional base learners. this is to prevent overfitting
2. gamma: Minimum loss reduction required to make a further partition on a leaf node.
3. alpha: L1 regularization term on weights.
4. lamdaL l2 regularization terms on weight


3. max_depth - how deep the tree's decision nodes can go. Must be a positive integer
4. subsample - fraction of the training set that can be used to train each tree. If this value is low, it may lead to underfitting or if it is too high, it may lead to overfitting
5. colsample_bytree -- fraction of the features that can be used to train each tree. A large value means almost all features can be used to build the decision tree




In [3]:
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5], #alpha
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}


In [4]:
from sklearn.model_selection import GridSearchCV

# Init classifier
xgb_cl = xgb.XGBClassifier(objective="binary:logistic")

# Init Grid Search, 
# n_jobs=-1 using all the process to run at the same time
#cross-validation: 5 fold by default
grid_cv = GridSearchCV(xgb_cl, param_grid, n_jobs=-1, cv=3, 
                       scoring="roc_auc")


In [5]:
grid_cv.fit(X_train, y_train)


In [6]:
grid_cv.best_score_

0.8388888888888889

In [7]:
grid_cv.best_params_




{'colsample_bytree': 0.5,
 'gamma': 1,
 'learning_rate': 0.1,
 'max_depth': 3,
 'reg_lambda': 10,
 'scale_pos_weight': 1,
 'subsample': 0.8}

In [None]:
"""
param_grid = {
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.01, 0.05],
    "gamma": [0, 0.25, 1],
    "reg_lambda": [0, 1, 10],
    "scale_pos_weight": [1, 3, 5], #alpha
    "subsample": [0.8],
    "colsample_bytree": [0.5],
}
"""

everything is on the edge except learning rate, so we have to keep working~

In [None]:
#first insert the new value
param_grid["subsample"] = [0.8]
param_grid["colsample_bytree"] = [0.5]
param_grid["learning_rate"] = [0.01]


#new iterations
param_grid["max_depth"] = [2,3, 4, 5]
param_grid["scale_pos_weight"]=[3, 5, 7]
param_grid["gamma"] = [0, 0.1, 0.2]
param_grid["reg_lambda"] = [5, 10, 15]

param_grid

In [None]:
grid_cv_2 = GridSearchCV(xgb_cl, param_grid, 
                         cv=3, scoring="roc_auc", n_jobs=-1)


In [None]:
grid_cv_2.fit(X_train, y_train)


In [None]:
grid_cv_2.best_score_

In [None]:
grid_cv_2.best_params_

In [None]:
{'max_depth': [2, 3, 4, 5],
 'learning_rate': [0.01],
 'gamma': [0, 0.1, 0.2],
 'reg_lambda': [5, 10, 15],
 'scale_pos_weight': [3, 5, 7],
 'subsample': [0.8],
 'colsample_bytree': [0.5]}

In [None]:
#first insert the new value
param_grid["subsample"] = [0.8]
param_grid["colsample_bytree"] = [0.5]
param_grid["reg_lambda"] = [10]
param_grid["gamma"] = [0.1]
param_grid["learning_rate"] = [0.01]
param_grid["max_depth"] = [2]


#new iterations
param_grid["scale_pos_weight"]=[5, 7, 9, 12]



In [None]:
grid_cv_3 = GridSearchCV(xgb_cl, param_grid, 
                         cv=3, scoring="roc_auc", n_jobs=-1)


In [None]:
grid_cv_3.fit(X_train, y_train)

In [None]:
grid_cv_3.best_score_  

# It tells you how well the best model from the grid search performs on the given data, 
#based on the evaluation metric, 
#but it doesn't say anything about how this model will perform on unseen data 
#or how it compares to other types of models.

In [None]:
grid_cv_3.best_params_

In [None]:
#grid2 is the best

In [None]:
preds = grid_cv.predict(X_test)

In [None]:
print(confusion_matrix(y_test,preds))

In [None]:
print(classification_report(y_test,preds))

lets do some exercise

## lets do capston projects

In [None]:
import pandas as pd

In [None]:
df_news=pd.read_csv("Combined_News_DJIA.csv")

In [None]:
df_news.head()

In [None]:
type(df_news['Date'].iloc[0])

In [None]:
df_news['Date']=pd.to_datetime(df_news['Date'])

In [None]:
type(df_news['Date'].iloc[0])

In [None]:
#getting DJIA data
df_price=pd.read_csv("DJIA-price.csv", thousands=r',') #notice the thoursands

In [None]:
df_price.head(5)

In [None]:
df_price['Date']=pd.to_datetime(df_price['Date'])

In [None]:
df_price.sort_values(by='Date', inplace=True)
df_price

In [None]:
df_price=df_price.reset_index(drop=True)
df_price

In [None]:
#lets add percentage
df_price['percentage']=df_price["Adj Close**"].pct_change()
df_price.head()

In [None]:
#let's find rolling standard deviation to find dates that are interesting
window = 52 # for two month
target_column = 'Adj Close**'
roll = df_price[target_column].rolling(window)
df_price['z-score'] = (df_price[target_column] - roll.mean()) / roll.std()
df_price
df_price.to_csv("new_djprice.csv",index=False)

In [None]:
# let's find the days that have significnat movement
df_significant_date=df_price[abs(df_price['z-score'])>2.5]
df_significant_date
df_significant_date.to_csv("df_significant_news.csv")

In [None]:
# what portion of the date is significant
len(df_significant_date)/len(df_price)

**Clean the data**

In [None]:
#lets print the 50 headlines
headlines=df_news['Top1'][0:10]
[print(x) for x in headlines]

In [None]:
# wanna see if the headlines information rank means anything
for i in range(3, 4):
    print ("day, ", i)
    [print(x) for x in df_news.iloc[i, 2:]]


In [None]:
#join the two dataframe to get a list of potentially important news
df_news.head()
df_significant_date
df_sig_news=df_significant_date.merge(df_news, how='left', on='Date')
df_sig_news.head()

things that we want to test
(1) whether it is financial information about US at all
(2) about sovrign national crisis that can affect global sentiment. 
matching with a list of country that matters
(3) how similar content are for each other


First thing we can do is to identify the sentiment

## First, can GPT be good financial market analyst?


In [None]:
# first thing that we do is to see is if there are significant
#!pip install openai
import openai

In [None]:
f = open("secret_key.env.txt", "r")
secret_key=f.read()
openai.api_key=secret_key

In [None]:
#lets first look at a toy example
#https://platform.openai.com/docs/api-reference/completions/create?lang=python
completion = openai.Completion.
create(model="text-davinci-003", prompt="Hello world")
completion

In [None]:
response=completion.choices[0].text
print(response)

In [None]:
#by the way that this is the model
#https://medium.com/@iryna230520/first-steps-in-langchain-the-ultimate-guide-for-beginners-part-1-2baf5a4e1b81

In [None]:
## Let's find out what openai think is relevant to the financial market

In [None]:
# lets put it into a function in case we need to use it
def openai_prompt(prompt):
    completion = openai.Completion.create(model="text-davinci-003", 
                                          prompt=prompt)
    response = completion.choices[0].text
    return response



In [None]:
def extract_headline(row_data):
    """
    to clean the dataset so that the format for the output is
    1: xxx
    2: xxx
    3: xxx
    """
    master_string=[str(i+1)+" : "+str(row_data[10:][i]) for i in range(0, len(row_data[10:]))]
    master_string="\n".join(master_string)
    return master_string


In [None]:
start_prompt="based on the below list of headline news, \
is there anything significant \
that can potentially impact the US stock market"

end_prompt="for the headlines that impact the market, \
make 3-5 words summary on the headline, \
seperate each factor by ':'. \
Reply 'NO' if there isnt any significant headline"

In [None]:
keyword_list=[]

for i in range(0, len(df_sig_news)):
    if i%5==0: print(i)
    mid_prompt=extract_headline(row_data=df_sig_news.iloc[i, :])
    entire_prompt=start_prompt+"\n\n"+mid_prompt+"\n\n"+end_prompt
    key_list=openai_prompt(entire_prompt)
    key_list=key_list.strip().split(":")
    print (key_list)
    #key_list=[x.strip() for x in key_list if len(x)>5]
    keyword_list=keyword_list+key_list



In [None]:
# here are some sample output
"""
['Attack on US embassy', ' Dead\nSecret treaty', ' Groups demand\nMiss']
['that affect the market\n\nJapan Debt', ' China Attack', ' Bank Bailout']
['Surveillance of Skype messages', ' China', ' NO \nReserve']
['that may affect the market\n\nUS-Pakistan Conflict', ' Kill 20', ' NO']
['.\n\nNO']
"""

In [None]:
#len(keyword_list)
#pd.DataFrame(keyword_list).to_csv("openai_raw_keyword_extraction.csv")

In [None]:
#supposed that you dont want to run the openai, you go go ahead and openup the keyword list

In [None]:
keyword_list=pd.read_csv("openai_raw_keyword_extraction.csv", index_col=0)

In [None]:
#convert it into a list
#wanna to take a look at the general keywords 
#pick the first column and call tolist() to convert it into a list
keyword_list=keyword_list.iloc[:, 0].tolist()


In [None]:
print(keyword_list)
len(keyword_list)

In [None]:
#lets remove the no. 
cleaned_keyword_list=[x for x in keyword_list if "no" not in x.lower()]
print (cleaned_keyword_list)
len(cleaned_keyword_list)

In [None]:
# lets clean the list of keywords up by removing the words before \n
cleaned_keyword_list=([x.split("\n")[-1] for x in cleaned_keyword_list])
print(cleaned_keyword_list)
len(cleaned_keyword_list)
#pd.DataFrame(cleaned_keyword_list).to_csv("cleaned_factor_affects_stock.csv")

In [None]:
# because it is very expensive to ask GPT to find signficance
# for all the headlines

In [None]:
# lets do some string cleaning, 
#make it all seperate words and then find repetitive items

master_single_words=[]
for i in range(0, len(cleaned_keyword_list)): 
    list_of_words=cleaned_keyword_list[i].split(" ")
    if len(list_of_words)==1: #special cases where the words is -
        list_of_words=cleaned_keyword_list[i].split("-")
    #remove the words len is smaller than 3 and has numerical values
    list_of_words=[x.lower() for x in list_of_words if x.isalpha() 
                   and len(x)>3]
    master_single_words=master_single_words+list_of_words

master_single_words=list(set(master_single_words))
print(len(master_single_words))
#master_single_words
#pd.DataFrame(master_single_words).to_csv("master_letter.csv")

master_single_words

In [None]:
#for future embedding works, these are the words that are important
important_issues=['oil', 'bank','military', 'russia', 'middle-east war', 'war', 'china economy', 'market price', 'nuclear war']