# Pre-Processing on the General Covid-19 Dataset

The dataset has a lot of entries which aren't in English. We have translated them to english (see below). Now we're going to visualize the different languages inside the dataset. The language classification is done by a Twitter ML algorithm.

In [8]:
from collections import Counter
import json
import pandas as pd
import altair as alt

data = []
with open('dataset/general_result_translated_full.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

index_lang=0
langs = []
for element in data:
    t=data[index_lang]['lang']
    langs.append(t)
    index_lang=index_lang+1
    
count=Counter(langs)
df = pd.DataFrame.from_dict(count, orient='index').reset_index()
df = df.rename(columns={'index':'lang', 0:'count'})

#TODO il mapping va finito con i flag, controlla se i flag che ho messo abbiano senso (nell'if). 
#Nel domain va rispettata la struttura dell'if sopra.
#I colori vanno sistemati perchè quelli chiari si vedono male, ma i colori cmq devono essere diversi tra lor
#per colorare le barre

#TODO riportare la struttura dell'if di FAKE nel for sopra

domain = ['ar','bg','ca','cs','cy','da','de','el','en','es','et','eu','fa','fi','fr','hi','ht','hu','in','is'..........]
range_ = ['red','green','blue','pink','purple','black','yellow','brown','darkgrey','orange','fuchsia','teal','coral','cyan','violet','crimson','lime','lightblue','lightgrey','khaki','tan','indigo','olive','gold','maroon','silver','azure','tomato','lightcyna','darkgreen','chocolate','plum','peru']

chart = alt.Chart(df).mark_bar().encode(
    x='count:Q',
    y='lang:N',
    #color=alt.Color('lang',scale=alt.Scale(domain=domain, range=range_),legend=alt.Legend(columns=4, symbolLimit=0))
)

text = chart.mark_text(
    align='left',
    baseline='middle',
    color='black',
    dx=3  # Nudges text to right so it doesn't appear on top of the bar
).encode(
    text='count:Q'
)


(chart + text).properties(height=900)


### traduttore.py

The .json file produced had multiple languages inside the text fields, we wrote this script to translate the fields which weren't in english ("hashtags" and "full_text") to english.

Everything is done through the Google Translate APIs.

Due to Google Translate Limitations to a massive number of requests, the for loop below does a pre-filtering, based on the lang filed from the .json file. The lang field is filled automatically during the hydratation process, the language classification is done by machine learning algorithms.

The script below is the full version, we've dived the execution in two phases: the first one worked on the "full_text" field, the second one worked on the "hashtag" filed considering that the "full_text" filed was already OK.

In [None]:
import json
import sys
import string
from google_trans_new import google_translator  
import time

data = []
with open('dataset/general_result.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))

f.close()


index=0
translator = google_translator()  
for element in data:
    if data[index]['lang']=="en":
        print(str(index)+" già inglese")
    else:
        translated  = translator.translate(data[index]['full_text'],lang_tgt='en')  
        data[index]['full_text'] = translated
        time.sleep(1) #sleep to avoid being blocked by Google 
        #print(str(index)+" indice" + data[index]['full_text'])
        for entity in data[index]['entities']['hashtags']:
            translated = translator.translate(entity['text'],lang_tgt='en')#lang_tgt è l'alt
            entity['text']=translated
            time.sleep(1)  #sleep to avoid being blocked by Google
            #print(str(index)+" indice" + entity['text'])
    index=index+1


with open('general_result_translated_full.json', 'a') as f_w:
    for line_w in data:
        #print("sto stampando")
        json.dump(line_w, f_w)
        f_w.write('\n')

f.close()