#     **Text Mining**
##     **NOVA IMS**
####     **Group 013:** Carlota Reis - 20211208 | Guilherme Miranda - 20210420  

## **Stock Market Prediction**
## Predicting stock market movement from news text
### This notebook uses the dateset *test.csv*

## **Dataset description**

- **ID**: unique identifier of each line/day
- **Headline**: news headlines ranging from "Headline1" to "Headline25". For each line, you should use these columns’ text (you are not required to use all columns) to predict the “Closing Status

## **Packages and Downloads**

In [34]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import string
from tqdm import tqdm
import re
from sklearn.neighbors import KNeighborsClassifier
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('stopwords')
stopwords.words('english')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\lotar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\lotar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\lotar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\lotar\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## **Data Import and Transformation**

In [35]:
#Importing datasets
stocks_df_test = pd.read_csv('test.csv')
stocks_df_train = pd.read_csv('train.csv')

#Transforming the datasets into dataframes
stocks_df_test = pd.DataFrame(data=stocks_df_test)
stocks_df_train = pd.DataFrame(data=stocks_df_train)

### *Train*

In [36]:
#Creating New Dataframe
train_df = stocks_df_train

#Appending all the Headlines into 1 column
train_df_aux = []
for row in range(0,1690):
    train_df_aux.append(' '.join(str(x) for x in train_df.iloc[row,2:26]))

#Adding this column to dataframe
train_df['Sum'] = train_df_aux

### *Test*

In [37]:
#Creating New Dataframe
test_df = stocks_df_test

#Appending all the Headlines into 1 column
test_df_aux = []
for row in range(0,299):
    test_df_aux.append(' '.join(str(x) for x in test_df.iloc[row,1:25]))

#Adding this column to dataframe
test_df['Sum'] = test_df_aux

# **Data Preprocessing**

In [38]:
stop = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')
lemma = WordNetLemmatizer()

In [39]:
def pre_clean(text_list):
    
    updates = []
    
    for j in tqdm(text_list):
        
        text = j
        
        #LOWERCASE TEXT
        text = text.lower()
        
        #REMOVE IDENTIFIED ISSUES IN THE CORPUS
        text = re.sub("b\"", "", text)
        text = re.sub("b'", "", text)
        text = re.sub("u.s.","us", text)
        text = re.sub("u.s","us", text)
        text = re.sub("us.","us", text)
        text = re.sub("u.k.","uk", text)
        text = re.sub("u.k","uk", text)
        text = re.sub("uk.","uk", text)
        text = re.sub("u.n.","un", text)
        text = re.sub("u.n","un", text)
        text = re.sub("un.","un", text)
        

        updates.append(text)
        
    return updates

In [40]:
def clean(text_list):
    
    updates = []
    
    for j in tqdm(text_list):
        
        text = j
        
      
        #REMOVE NUMERICAL DATA and PUNCTUATION
        text = re.sub("[^a-zA-Z]"," ", text )

        #REMOVE STOPWORDS
        text = " ".join([word for word in text.split() if word not in stop])
                
        #Lemmatize
        text = " ".join(lemma.lemmatize(word) for word in text.split())
        
        #Stemming
        text = " ".join(stemmer.stem(word) for word in text.split())
            
        updates.append(text)
        
    return updates

In [41]:
#Pre-Clean
test_df_clean = pre_clean(test_df['Sum'])
train_df_clean = pre_clean(train_df['Sum'])

#Clean
test_df_clean = clean(test_df_clean)
train_df_clean = clean(train_df_clean)


100%|██████████| 299/299 [00:00<00:00, 27176.13it/s]
100%|██████████| 1690/1690 [00:00<00:00, 31289.86it/s]
100%|██████████| 299/299 [00:00<00:00, 347.22it/s]
100%|██████████| 1690/1690 [00:04<00:00, 346.00it/s]


# **Feature Engineering - Bag of Words (BoW)**

In [42]:
bow = CountVectorizer(binary=True)

#Defining independent variables
X = bow.fit_transform(train_df_clean)

#Defining dependent variable
y = np.array(train_df['Closing Status'])


# **Classification Model - KNN**

In [43]:
#Train Classifier
modelknn = KNeighborsClassifier(n_neighbors = 10, metric = 'cosine', weights = 'distance')

#Fitting the model
modelknn.fit(X,y)

KNeighborsClassifier(metric='cosine', n_neighbors=10, weights='distance')

In [44]:
#Defining the independent
X_test = bow.transform(test_df_clean)


#Predicting for test set
y_pred = modelknn.predict(X_test)

#If we want the values of the predictions, perform this

test_df['Pred_KNN_BoW'] = y_pred.tolist()


# **Final predictions**

In [53]:
predictions = test_df[['Id','Pred_KNN_BoW']]

In [58]:
# Export to csv file
predictions.to_csv('Predictions_013.csv', index=False, sep = ';')