<a href="https://colab.research.google.com/github/iambusra/complingproject/blob/main/Data_Frame_with_Sentiment_Values_after_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pandas
!pip install xlsxwriter
!pip install openpyxl

import pandas as pd
import xlsxwriter
from google.colab import files
import io
import numpy as np




In [None]:
# uploading the lexicon txt file which we will use for sentiment analysis via Files section on the left


In [None]:
# turning the file into a data frame called lexicon

lexicon = pd.read_csv("Turkish-tr-NRC-VAD-Lexicon.txt", sep="\t", error_bad_lines=False)

In [None]:
# uploading news data via Files section on the left

In [None]:
# turning the file into a data frame called news
news = pd.read_excel("data_clean.xlsx")

# removing the first two columns, which do not contain any relevant information
news = news.drop(['Unnamed: 0', 'Haber Başlıkları'], axis = 1)
print(news)

                                               Linkler                                           Metinler
0                                    [Google Haberler]                                                NaN
1    http://www.haberinadresi.com/bursa-da-kadin-su...  [Bursa’da kimliği belirsiz bir kişi, gözüne ke...
2    http://www.pusulagazetesi.com.tr/kadin-surucu-...  Kadın sürücü kaza yaptı. Zonguldak’ta Kapuz Ma...
3    http://www.pusulagazetesi.com.tr/kadin-surucu-...  Kadın sürücü ve annesi yaralandı, kaza anı kam...
4    http://www.pusulagazetesi.com.tr/kaza-yapan-ka...  Zonguldak’ın Ereğli ilçesinde sürücünün direks...
..                                                 ...                                                ...
226  https://www.sozcu.com.tr/2021/gundem/alkollu-k...  İzmir'in Konak ilçesinde, otomobili uygulama n...
227  https://www.sozcu.com.tr/2021/gundem/seyahat-i...  Edinilen bilgilere göre, ilçe dışından gelen v...
228  https://www.sozcu.com.tr/2021/gunun-icind

In [None]:
# preprocessing function that we have written in the class.
# We are using Turkish stop words this time since our data is in Turkish.

import nltk
import string
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

stopwords = set(stopwords.words('turkish'))

def PreprocessNews(text):
  text = text.lower()
  text = text.translate(str.maketrans('', '', string.punctuation))
  tokenized_text = nltk.word_tokenize(text)
  clean_text = [word for word in tokenized_text if word not in stopwords ]
  return(clean_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# to make sure that every row in Metinler is of type str for preprocessing
news.Metinler = news.Metinler.astype(str)

In [None]:
# preprocessing our news bodies
news['Metinler'] = news.apply(lambda x: PreprocessNews(x['Metinler']),axis=1)

# Let's find the words that are present both in the Lexicon and the news bodies.

In [None]:
# a loop that will save to matching_words list the tokens in the news bodies which have a matching entry in the lexicon.
matching_words = []
for sentence in metinler:
  for word in sentence:
    if word in lexicon["Turkish-tr"].tolist():
      matching_words.append(word)

# It took 15 seconds to complete. Let's check the first 50 matching tokens.
matching_words[0:50]

**How many words in our news bodies will end up having sentiment values by virtue of being listed in the lexicon?**

In [None]:
len(matching_words)

11477

**We have 11477 tokens in news bodies which are listed in the lexicon. Cool!**

**And how many types?**

In [None]:
len(np.unique(matching_words))

1206

**We have 1206 types in news bodies that are in the Sentiment lexicon and this is the number we get without using any morphological analyzer. We will incorporate an analyzer (as Büşra Marşan's or a better one) to increase this number.**

This notebook was sort of an exploratory analysis to see if our lexicon will be useful or not. Based on the number of matching lexical items, this lexicon seems promising at least for this project. Now, what we further need to do is to find the matching lexical items in our lexicon for all words (including the morphologically complex ones) in our data and assign them the corresponding sentiment values.

Then, we will move on to randomly separate our data so that we have different data sets for both training and testing phases. Also, we will categorize the news (supervised learning) as biased or neutral in order to teach our model how to categorize them by analyzing the features (sentiment values) in the data frame.

Finally, we will give the classifier we built the uncategorized test data and see if it can classify them based on sentiment values. We will evaluate the classifier on different dimensions like accuracy and precision by using scikitlearn.

# Assigning Sentiment Values to Tokens in the News Body

Somehow we need the following code which we used several cells before for preprocessing again

In [None]:
# to make sure that every row in Metinler is of type str for preprocessing
news.Metinler = news.Metinler.astype(str)

# preprocessing our news bodies
news['Metinler'] = news.apply(lambda x: PreprocessNews(x['Metinler']),axis=1)

In [None]:
# create empty lists for saving sentiment values in the loop below
A = []
D = []
V =[]

# create variables with type int and value 0 so that we can add the sentiment values
# for each token to them and return the total sentiment values of a news body in the end.
arousal = 0
dominance = 0
valence = 0

# Look at the news body one by one (one row at a time).
# Then for the news body in that row, find sentiment value for each word.
# Add them up until moving onto the next news body in the next row.
for metin in news['Metinler']:
  for word in metin:
    if word in lexicon["Turkish-tr"].tolist():
      arousal += lexicon["Arousal"][lexicon["Turkish-tr"].tolist().index(word)]
      dominance += lexicon["Dominance"][lexicon["Turkish-tr"].tolist().index(word)]
      valence += lexicon["Valence"][lexicon["Turkish-tr"].tolist().index(word)]

  A.append([arousal])
  D.append([dominance])
  V.append([valence])



print(A)
print(D)
print(V)

[[0], [31.07000000000001], [43.61100000000001], [62.19499999999999], [80.83200000000004], [81.12400000000004], [81.12400000000004], [147.33999999999995], [147.33999999999995], [200.04599999999994], [337.43499999999943], [337.43499999999943], [337.43499999999943], [337.43499999999943], [348.26899999999944], [348.8569999999994], [349.1849999999994], [349.51299999999935], [362.95099999999945], [404.6509999999993], [433.8579999999994], [438.3909999999994], [438.71899999999937], [487.82799999999907], [495.2869999999992], [519.1249999999991], [542.1589999999992], [562.1459999999992], [581.1219999999988], [599.9579999999992], [640.4429999999994], [656.5659999999995], [670.8629999999996], [726.2989999999995], [757.1999999999995], [780.8579999999997], [789.0939999999999], [801.0899999999999], [828.7130000000001], [844.3470000000003], [932.9670000000015], [991.6560000000023], [1040.3010000000033], [1116.7700000000025], [1280.0039999999988], [1343.8599999999983], [1364.6099999999976], [1387.60699

In [None]:
# Check if the number of rows in the news match the number of items in the lists A, D, and V.
print(len(A))
print(len(D))
print(len(V))

# 231 rows as expected. Great!

231
231
231


In [None]:
# Let's append these lists with sentiment values to news data frame.
news["Arousal"] = A
news["Dominance"] = D
news["Valence"] = V
news.head()

Unnamed: 0,Linkler,Metinler,Arousal,Dominance,Valence
0,[Google Haberler],[nan],[0],[0],[0]
1,http://www.haberinadresi.com/bursa-da-kadin-su...,"[bursa, ’, kimliği, belirsiz, bir, kişi, gözün...",[31.07000000000001],[32.668000000000006],[36.458999999999996]
2,http://www.pusulagazetesi.com.tr/kadin-surucu-...,"[kadın, sürücü, kaza, yaptı, zonguldak, ’, ta,...",[43.61100000000001],[44.11899999999998],[48.931]
3,http://www.pusulagazetesi.com.tr/kadin-surucu-...,"[kadın, sürücü, annesi, yaralandı, kaza, anı, ...",[62.19499999999999],[63.977999999999966],[71.80900000000001]
4,http://www.pusulagazetesi.com.tr/kaza-yapan-ka...,"[zonguldak, ’, ın, ereğli, ilçesinde, sürücünü...",[80.83200000000004],[81.30699999999996],[88.44700000000002]


In [None]:
# saving the data frame with sentiment values
writer = pd.ExcelWriter('Sentiment_data_clean.xlsx')
news.to_excel(writer)
writer.save()