
In this file we will work with the generated CSV files.


In [2]:
import pandas as pd
import nltk
import numpy as np
import spacy
from textblob import TextBlob
from transformers import pipeline
import re
import string
from wordcloud import WordCloud,STOPWORDS
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.pylab as pl
import matplotlib.gridspec as gridspec

We will use pretrained BERT based transformer models to analyse the sentiment in three languages:
English, Turkish and German

The pretrained models can be installed via following commands:

pip install transformers
git lfs install
git clone https://huggingface.co/oliverguhr/german-sentiment-bert
git clone https://huggingface.co/savasy/bert-base-turkish-sentiment-cased

Read documentation here: 

https://huggingface.co/transformers/quicktour.html
https://huggingface.co/savasy/bert-base-turkish-sentiment-cased#
https://huggingface.co/oliverguhr/german-sentiment-bert?text=I+like+you.+I+love+you#


Model for sentiment analysis in turkish language

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer_tr = AutoTokenizer.from_pretrained("savasy/bert-base-turkish-sentiment-cased")

model_tr = AutoModelForSequenceClassification.from_pretrained("savasy/bert-base-turkish-sentiment-cased")

turkish= pipeline("sentiment-analysis", tokenizer=tokenizer_tr, model=model_tr)

#example

t = turkish("bu telefon modelleri çok kaliteli , her parçası çok özel bence")
print(t)
# [{'label': 'LABEL_1', 'score': 0.9871089}]
# print(p[0]['label'] == 'LABEL_1')
# True
t[0]['label']

[{'label': 'positive', 'score': 0.983146607875824}]


'positive'

Model in english language

In [4]:
english = pipeline('sentiment-analysis')

#example
e=english("Hey, you are beautiful")

e

[{'label': 'POSITIVE', 'score': 0.999866783618927}]

Model in german language

In [5]:

tokenizer_gr = AutoTokenizer.from_pretrained("oliverguhr/german-sentiment-bert")

model_gr = AutoModelForSequenceClassification.from_pretrained("oliverguhr/german-sentiment-bert")

german= pipeline("sentiment-analysis", tokenizer=tokenizer_gr, model=model_gr)

#example

g=german("Ich liebe dich")

g[0]['score']

0.9846150875091553

We will now define functions to get sentiment and scores for turkish english and german languages.
We will apply these functions to our pandas dataframes using "df.apply(function)"" method

Here 'sentiment_tr' and 'sentiment_score_tr' refer to the funtions that returns the sentiment (Positive/Negative) 
and the score (0-1) for a text in turkish language.

Similarly we have defined functions to return the sentiment and scores for text in english and german.


In [6]:


def sentiment_tr(sentence):
    sentiment=turkish(sentence)[0]['label']
    return sentiment

def sentiment_score_tr(sentence):
    score=turkish(sentence)[0]['score']
    return score


def sentiment_gr(sentence):
    sentiment=german(sentence)[0]['label']
    return sentiment

def sentiment_score_gr(sentence):
    score=german(sentence)[0]['score']
    return score


def sentiment_en(sentence):
    sentiment=english(sentence)[0]['label']
    return sentiment

def sentiment_score_en(sentence):
    score=english(sentence)[0]['score']
    return score
    

We will now download the stopwods in English, Turkish and German from spacy

We will add some more stopwords from hindi and create a list of stopwords.

In [14]:
#stopwords. We chose stopwords from spacy because it has more stopwords than nltk and also has turkish stopwords

from spacy.lang.en.stop_words import STOP_WORDS as stopwords_english
from spacy.lang.de.stop_words import STOP_WORDS as stopwords_deutsch
from spacy.lang.tr.stop_words import STOP_WORDS as stopwords_turkish

with open('data/hindi_stopwords.txt',encoding='utf-8') as file:  
    line=file.read()
    

stopwords_hindi=line.split()

more_stopwords=['erdoÄŸan','merkel','modi','bjp','cdp','akp','hÃ¼kÃ¼met','gsyä°h','government','govt','regierung','wirtschaft',
                'corona','covid','economy','ekonomi','keyword','erdoÄŸan','merkel','modi','bjp','cdp','akp','hÃ¼kÃ¼met','gsyä°h',
                'government','govt','regierung','wirtschaft','ekonomi','keyword','जी','है।','है,','ji','shri','ji.','जी','हैं।','pm',
                'india’s','day','जी।','बहुत-बहुत','हर','लोगों','rt']

stop_words=list(stopwords_english)+list(stopwords_deutsch)+list(stopwords_turkish)+stopwords_hindi+more_stopwords

We will now load the csv which contains tweets on various keywords from 12 locations in 3 countries.

In [15]:


df=pd.read_csv('.\data\dataset.csv',encoding='utf-8')

In [16]:
df.sample(3)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country
20298,@nan59232397\n@Pervinsenem1\n \nAkp çıkana kad...,2/1/2021 11:58,Izmir,AKP,Government,Turkey
22757,Bu ülkede normal olan herhangi birşey var mı??...,3/1/2021 10:24,Antalya,ekonomi,Economy,Turkey
27336,My mum (90 in March) got her first COVID vacci...,1/19/2021 18:49,Cologne,vaccine,Vaccine,Germany


In [17]:
df.dtypes

Tweets       object
Timestamp    object
Location     object
Keyword      object
Category     object
Country      object
dtype: object

We will now tokenize the words in tweets using nltk.tokenize() method

Only words that are not in stopwords listed defined earlier will be included.

Also, the words will be converted to lower case

In [18]:

tokenize =lambda x: [word.lower() for word in x.split() if word.lower() not in stop_words and 
                     word.startswith('@')==False and word.startswith('https')==False and word.isdigit()==False] 


df['Word_Tokens']=df['Tweets'].apply(tokenize)

In [19]:
df.sample(5)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens
17899,இளைஞர்கள் அரசியலில் நுழைய வேண்டும் - பிரதமர் ம...,12/1/2021 9:29,Chennai,BJP,Government,India,"[இளைஞர்கள், அரசியலில், நுழைய, வேண்டும், -, பிர..."
130,Die Antwort auf die Frage nach dem #homeoffice...,5/1/2021 18:25,Berlin,Merkel,Angela Merkel,Germany,"[antwort, frage, #homeoffice, zeigt, schön, po..."
13384,https://t.co/aOA5wys1iw https://t.co/uAhAtyCa5O,30-12-2020 16:00,Chennai,economy,Economy,India,[]
27934,In Berlin stehen in unterschiedlichen Impfzent...,1/18/2021 17:00,Hamburg,Pfizer-Biontech,Vaccine,Germany,"[berlin, stehen, unterschiedlichen, impfzentre..."
12669,इकॉनमी के लिए खुशखबरी ला रहे हैं ये आंकड़े..\n...,4/1/2021 4:50,Delhi,economy,Economy,India,"[इकॉनमी, खुशखबरी, ला, आंकड़े..]"


We will now define another function to clean the tweets

In [20]:
#define function to remove noise

def clean_text(text_list):
    for text in text_list:
        text = re.sub(r'@[a-zA-Z0-9_]+','',text)
        text = re.sub(r'#','',text)
        text = re.sub(r'rt[\s]+','',text)
        text = re.sub(r'https?:\/\/\s+','',text)
        text = re.sub(r'timestamp|keyword|tweets|geocode|modi|merkel|erdogan|tayyip','',text)
        text = re.sub('\[.*?\]', '', text)
        text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub('\w*\d\w*', '', text)
        text = re.sub('[‘’“”…]', '', text)
        text = re.sub('@', '', text)
        text = re.sub(',', '', text)
        text = re.sub(':', '', text)
#     translator = google_translator()
#     text=translator.translate(text,lang_tgt='en')
    
    
    return text_list

In [21]:
df['Word_Tokens']=df['Word_Tokens'].apply(clean_text)

In [22]:
df.sample(4)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens
31286,@mehmettt6006 Faz3 ara sonucu açıklanacak Türk...,1/21/2021 7:41,Ankara,Pfizer-Biontech,Vaccine,Turkey,"[faz3, ara, sonucu, açıklanacak, türkiye, kolu..."
29240,@AroonTripathi @APS588 Suspect vaccine,1/20/2021 17:27,Chennai,vaccine,Vaccine,India,"[suspect, vaccine]"
8732,Will The Effects of COVID-19 Continue to Influ...,31-12-2020 13:57,Hamburg,covid,Public confidence in government's handling of ...,Germany,"[effects, covid-19, continue, influence, consu..."
19198,Emine Erdoğan'ın sahiplendiği leblebinin bende...,29-12-2020 11:37,Antalya,Erdoğan,Recep Tayyip Erdo?an,Turkey,"[emine, erdoğan'ın, sahiplendiği, leblebinin, ..."


In [23]:
df['Word_Tokens']=df['Word_Tokens'].apply(lambda x: " ".join(x))

In [24]:
df.sample(4)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens
24029,Anger as Mexico's Covid-19 czar makes beach tr...,5/1/2021 0:40,Istanbul,covid,Public confidence in government's handling of ...,Turkey,anger mexico's covid-19 czar makes beach trip ...
30947,Greece will not require tourists to provide ce...,1/15/2021 13:01,Istanbul,vaccine,Vaccine,Turkey,greece require tourists provide certification ...
17755,@BhavikaKapoor5 @AamAadmiParty @Kisanaktamorch...,12/1/2021 14:22,Mumbai,BJP,Government,India,spreading fake news dna congress. thugs have…
1737,@UlrichSchneider @Paritaet @DKSB_Bund @Tafel_D...,5/1/2021 18:41,Berlin,Regierung,Government,Germany,


In [26]:
#create separate dataframes for turkey, india and germany

df_germany=df[df['Country']=='Germany']
df_turkey=df[df['Country']=='Turkey']
df_india=df[df['Country']=='India']

In [27]:
df_turkey.sample(2)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens
21053,AKP'ye yakın bir gazetenin Aylin Sözer cinayet...,29-12-2020 15:07,Antalya,AKP,Government,Turkey,akp'ye yakın bir gazetenin aylin sözer cinayet...
18608,Bu ulke #ERDOĞANLAŞAHLANIŞ yasadi bu bir gerce...,3/1/2021 17:46,Istanbul,Erdoğan,Recep Tayyip Erdo?an,Turkey,ulke #erdoğanlaşahlaniş yasadi bir gercek su d...


In [28]:
#add clolumns for sentiment and score for turkish, german and indian dataframe using the functions and models defined above
df_turkey['sentiment']=df_turkey['Word_Tokens'].apply(sentiment_tr)
df_turkey['score']=df_turkey['Word_Tokens'].apply(sentiment_score_tr)

df_germany['sentiment']=df_germany['Word_Tokens'].apply(sentiment_gr)
df_germany['score']=df_germany['Word_Tokens'].apply(sentiment_score_gr)

df_india['sentiment']=df_india['Word_Tokens'].apply(sentiment_en)
df_india['score']=df_india['Word_Tokens'].apply(sentiment_score_en)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_turkey['sentiment']=df_turkey['Word_Tokens'].apply(sentiment_tr)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_turkey['score']=df_turkey['Word_Tokens'].apply(sentiment_score_tr)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_germany['sentiment']=df_germany['Word_Tokens'].apply(sentiment_g

In [154]:
df_india.sample(3)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,country,Word_Tokens,sentiment,score
15446,Amidst these hard Corona time. There come a gr...,02-01-2021 09:02,Kolkata,corona,India,amidst hard time. come great relief pm.he decl...,POSITIVE,0.99054
14123,@NITINchirp @kmimicryartist @AnOpenLetter001 j...,02-01-2021 15:27,Mumbai,GDP,India,jab waha ka idiot president mask pehan se mana...,NEGATIVE,0.947328
12534,"Always supporting the economy, this company di...",04-01-2021 14:17,Delhi,economy,India,"supporting economy, company farmers reliance f...",POSITIVE,0.996716


In [159]:

df_india['Category']=df_india['Keyword'].apply(categories)
df_turkey['Category']=df_turkey['Keyword'].apply(categories)
df_germany['Category']=df_germany['Keyword'].apply(categories)

# df_india.to_csv("Indien2021.csv", encoding="utf-8")
# df_turkey.to_csv("turkiye2021.csv", encoding="utf-8")
# df_germany.to_csv("germania2021.csv", encoding="utf-8")



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_india['Category']=df_india['Keyword'].apply(categories)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_turkey['Category']=df_turkey['Keyword'].apply(categories)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_germany['Category']=df_germany['Keyword'].apply(categories)


In [173]:
#remove columns with na values

df_turkey = df_turkey[df_turkey["Location"].notna()]


In [175]:
df_turkey.sample(3)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,country,Word_Tokens,sentiment,score,Category
21367,@__stds__ @cibukadam @avukat_osman Canım arkad...,30-12-2020 07:43,Istanbul,hükümet,Turkey,"canım arkadaşım, özel sektör üst düzey yönetic...",positive,0.716583,hükümet
18585,#ERDOĞANLAŞAHLANIŞ #Türkiye \nReis Akdeniz’e v...,03-01-2021 20:21,Istanbul,Erdoğan,Turkey,#erdoğanlaşahlaniş #türkiye reis akdeniz’e bat...,positive,0.81297,erdoğan
23649,6 Ocak 2021 Çarşamba günü Saat 12.00-12.45 Atü...,04-01-2021 09:51,Izmir,corona,Turkey,ocak çarşamba günü saat 12.00-12.45 atürk tv &...,positive,0.876716,Govt Handling Of Pandemic


In [193]:

df_india['Category']=df_india['Category'].apply(categories)
df_turkey['Category']=df_turkey['Category'].apply(categories)
df_germany['Category']=df_germany['Category'].apply(categories)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_india['Category']=df_india['Category'].apply(categories)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_turkey['Category']=df_turkey['Category'].apply(categories)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_germany['Category']=df_germany['Category'].apply(categories)


In [32]:
frames=[df_germany,df_turkey,df_india]

df=pd.concat(frames)

In [33]:
df.sample(5)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens,sentiment,score
24896,"Yav ben artık dayanamıyorum artık, bu hekimler...",3/1/2021 11:31,Ankara,covid,Public confidence in government's handling of ...,Turkey,"yav dayanamıyorum artık, hekimlerin sağlık çal...",negative,0.946675
22217,2021'de trafik cezaları el yakacak https://t.c...,1/1/2021 9:35,Istanbul,ekonomi,Economy,Turkey,2021'de trafik cezaları el yakacak,positive,0.512002
13947,@INCIndia Aap hote to complete lockdown mei bh...,31-12-2020 11:19,Delhi,GDP,Economy,India,aap hote complete lockdown mei bhi 100% growth...,POSITIVE,0.614318
8435,@NYGovCuomo SO in EASTERN EUROPE the ANTI FASC...,3/1/2021 22:23,Hamburg,covid,Public confidence in government's handling of ...,Germany,eastern europe anti faschists politically mono...,negative,0.970511
3694,"@MarleneJWeiss Ne. Und ich finde, es ist jetzt...",5/1/2021 10:48,Munich,Wirtschaft,Economy,Germany,"ne. finde, echt zeit, (berufstätigen) eltern i...",negative,0.975906


In [37]:
df['Sentiment Score']=df['sentiment'].apply(lambda x: 1 if x.lower()=="positive" else -1)

In [38]:
df.sample(3)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens,sentiment,score,Sentiment Score
24041,Tacikistan'da 8 aydan bu yana ilk defa Covid-1...,4/1/2021 23:00,Istanbul,covid,Public confidence in government's handling of ...,Turkey,tacikistan'da aydan yana covid-19 vakası sapta...,negative,0.939047,-1
21024,@arz_che @HergelePostasi Bak kardeşim. Gerçek...,30-12-2020 00:32,Antalya,AKP,Government,Turkey,bak kardeşim. gerçekten ciddi bir sorunumuz va...,negative,0.992583,-1
8652,@MrJonasDanner Kannte bis zum 30.12. auch niem...,1/1/2021 15:01,Hamburg,covid,Public confidence in government's handling of ...,Germany,kannte 30.12. niemanden. cousine gestorben. al...,negative,0.935748,-1


In [39]:
df['Sentiment Score']=df['Sentiment Score']*df['score']

In [57]:
df['Positive']=df['sentiment'].apply(lambda x: 1 if x.lower()=='positive'else 0)

df['Positive']=df['Positive']*df['score']

df['Negative']=df['sentiment'].apply(lambda x: 1 if x.lower()=='negative'else 0)

df['Negative']=df['Negative']*df['score']

df['sentiment']=df['sentiment'].apply(lambda x: x.upper())


In [58]:
df.sample(4)

Unnamed: 0,Tweets,Timestamp,Location,Keyword,Category,Country,Word_Tokens,sentiment,score,Sentiment Score,Positive,Negative
9404,"Was gut für Europa ist, ist gut für uns: Angel...",6/1/2021 20:09,Munich,Merkel,Angela Merkel,Germany,"europa ist, uns: angela betreibt ausverkauf de...",NEUTRAL,0.999931,-0.999931,0.0,0.0
2951,https://t.co/H0gHwucwt1\n\nDa gehen sie hin un...,31-12-2020 10:25,Hamburg,Regierung,Government,Germany,"gelder, wohnung sehen kann, entwed…",NEUTRAL,0.977776,-0.977776,0.0,0.0
10236,@me_locket These words reminds me of modi rule...,4/1/2021 10:54,Kolkata,Modi,Narendra Modi,India,words reminds rule india,POSITIVE,0.997276,0.997276,0.997276,0.0
18518,Erdoğan'ın başörtüsü sorunu - Medyascope\nBugü...,4/1/2021 9:26,Istanbul,Erdoğan,Recep Tayyip Erdo?an,Turkey,erdoğan'ın başörtüsü sorunu - medyascope bugün...,NEGATIVE,0.879537,-0.879537,0.0,0.879537


In [59]:
df_germany=df[df['Country']=='Germany']
df_turkey=df[df['Country']=='Turkey']
df_india=df[df['Country']=='India']

We save the dataframes which contain sentiment and scores as csv files

In [60]:
df_germany.to_csv('.\data\germany.csv',encoding='utf-8')

df_india.to_csv('.\data\india.csv',encoding='utf-8')

df_turkey.to_csv('.\data\Turkey.csv', encoding='utf-8')

df.to_csv('.\data\Final.csv', encoding='utf-8')

In [2]:
import pandas as pd

In [8]:
df = pd.read_csv('./data/Final.csv')

In [14]:
df=df.iloc[:,1:]

In [15]:
df_germany=df[df['Country']=='Germany']
df_turkey=df[df['Country']=='Turkey']
df_india=df[df['Country']=='India']

In [16]:
df_india.groupby(['Location'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000241C474ED60>

In [3]:
" I love turkey".split()

['I', 'love', 'turkey']

In [None]:
func =lambda x: x fot 