<a href="https://colab.research.google.com/github/oscarvilla/blog_entries_analysis/blob/master/upload_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Inicio
Lo primero es cargar las librerías que vamos a necesitar; de acuerdo con el método presentado en
https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92

In [None]:
import pandas as pd
import nltk
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [None]:
!pip install langdetect
from langdetect import detect



## Vamos a importar desde Drive mismo.

In [None]:
# Code to read csv file into Colaboratory:
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Una vez que hemos obtenido el link para compartir el archivo que vamos a importar, lo pegamos aquí (sólo se cambia si cambiamos el archivo)

In [None]:
link = 'https://drive.google.com/open?id=1GjWbMO139LNna2bZbUShmWhPZ06J8hKc' # The shareable link


Dado que lo que se quiere es la parte del id, lo separamos

In [None]:
fluff, id = link.split('=')
print (id) # Verify that you have everything after '='

1GjWbMO139LNna2bZbUShmWhPZ06J8hKc


Y para importar el csv como data frame de Pandas

In [None]:
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('blogtext.csv')  
df = pd.read_csv('blogtext.csv')

Veamos las primeas filas del archivo importado

In [None]:
df[0:6]

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...


En 'sia' instaciamos el Sentiment Intensity Analyzer, que es al que el iterarador le va a pasar cada texto para que nos devuelva el análisis de la intensidad de sentimientos en él: 
- pos: proporción de positividad del texto [0, 1]
- neu: proporción de neutralidad del texto [0, 1]
- neg: proporción de negatividad del texto [0, 1]
Los tres suman 1
- compound: la combinación de los tres anteriores que arroja un número entre [-1, 1] extremadamente negativo y extremadamente positivo, respectivamente.

Además, como tenemos textos que no están en inglés, vamos a crear una variable que es el idioma del texto, para lo cual nos servimos de langdetect.



In [None]:
import time
start = time.time()
sia = SIA()
results = []

for n in df['text']:
    
    pol = sia.polarity_scores(n)
    
    pol['text'] = n
    
    try: 
      pol['lan'] = detect(n)
      
    except:
      pol['lan'] = "NA"
      
    results.append(pol)

end = time.time()

print(end - start)

8556.464194774628


Antes de ver los resultados, vamos a guardarlos, ya que nos ha llevado casi tres horas obtenerlos.

In [None]:
#pd.DataFrame(results).to_hdf('results.h5', key='df', mode='w')

In [None]:
#from google.colab import files
#files.download('results.h5')

In [None]:
results = pd.DataFrame(results)           
results[0:10]

Unnamed: 0,compound,lan,neg,neu,pos,text
0,0.0,en,0.0,1.0,0.0,"Info has been found (+/- 100 pages,..."
1,0.0,nl,0.0,1.0,0.0,These are the team members: Drewe...
2,-0.8167,en,0.09,0.814,0.097,In het kader van kernfusie op aarde...
3,0.0,en,0.0,1.0,0.0,testing!!! testing!!!
4,0.8805,en,0.0,0.841,0.159,Thanks to Yahoo!'s Toolbar I can ...
5,0.9847,en,0.04,0.874,0.086,I had an interesting conversation...
6,0.8929,en,0.078,0.787,0.136,Somehow Coca-Cola has a way of su...
7,0.743,en,0.073,0.842,0.085,"If anything, Korea is a country o..."
8,-0.8248,en,0.101,0.804,0.095,Take a read of this news article ...
9,-0.5588,en,0.099,0.827,0.073,I surf the English news sites a l...


Veamos la proporción de entradas que están en inglés

In [None]:
len(results.loc[results['lan'] == 'en']) / len(results)

0.9562561281345224

Dada la proporción, es una variable que no vamos a retener; haciendo un balance entre lo que queremos y lo que nos cuesta.

Juntemos los resultados con el set de datos original, dejando de las variables obtenidas solamente las que vamos a utilizar.

In [None]:
df = df.join(results, lsuffix='_orig', rsuffix='_res').drop(labels = ['text_res', 'neg', 'neu', 'pos', 'lan'], axis = 1)
df[0:19]

Unnamed: 0,id,gender,age,topic,sign,date,text_orig,compound
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...",0.0
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,0.0
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,-0.8167
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,0.0
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,0.8805
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...,0.9847
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...,0.8929
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o...",0.743
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...,-0.8248
9,3581210,male,33,InvestmentBanking,Aquarius,"09,June,2004",I surf the English news sites a l...,-0.5588


Ahora que los tenemos juntos, guardamos el nuevo set de datos