<a href="https://colab.research.google.com/github/oscarvilla/blog_entries_analysis/blob/master/upload_csv.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Inicio
Lo primero es cargar las librerías que vamos a necesitar

In [1]:
import pandas as pd
import nltk
#nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA
from langdetect import detect
from datetime import datetime
import time



## Vamos a importar el set de datos

In [2]:
df = pd.read_csv('data/blogtext.csv')

Vamos a ver las primeras filas del archivo importado.

In [3]:
df[0:6]

Unnamed: 0,id,gender,age,topic,sign,date,text
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,..."
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...


Para hacer pruebas, tomamos una submuestra del total.

## Obtenemos el sentimiento de cada texto

En 'sia' instaciamos el Sentiment Intensity Analyzer, que es al que el iterarador le va a pasar cada texto para que nos devuelva el análisis de la intensidad de sentimientos en él: 
- pos: proporción de positividad del texto [0, 1]
- neu: proporción de neutralidad del texto [0, 1]
- neg: proporción de negatividad del texto [0, 1]
Los tres suman 1
- compound: la combinación de los tres anteriores que arroja un número entre [-1, 1] extremadamente negativo y extremadamente positivo, respectivamente.

Debido a que el proceso con cada texto toma tiempo, paralelizamos el for para hacerlo más rápido.



In [6]:
from concurrent import futures
import time
start = time.time()

results = []

def features(text):
    sia = SIA()
    pol = sia.polarity_scores(text)
    pol['text'] = text
    return pol
    
with futures.ProcessPoolExecutor() as pool:
  for p in pool.map(features, df['text'], chunksize = 250):
    results.append(p)
    
end = time.time()
print(end - start)

2737.3598062992096


In [7]:
pd.DataFrame(results[0:9])

Unnamed: 0,compound,neg,neu,pos,text
0,0.0,0.0,1.0,0.0,"Info has been found (+/- 100 pages,..."
1,0.0,0.0,1.0,0.0,These are the team members: Drewe...
2,-0.8167,0.09,0.814,0.097,In het kader van kernfusie op aarde...
3,0.0,0.0,1.0,0.0,testing!!! testing!!!
4,0.8805,0.0,0.841,0.159,Thanks to Yahoo!'s Toolbar I can ...
5,0.9847,0.04,0.874,0.086,I had an interesting conversation...
6,0.8929,0.078,0.787,0.136,Somehow Coca-Cola has a way of su...
7,0.743,0.073,0.842,0.085,"If anything, Korea is a country o..."
8,-0.8248,0.101,0.804,0.095,Take a read of this news article ...


Aquí vemos los resultados. Lo que de paso nos permite observar que hay entradas de texto que no están en inglés (están en holandés). Revisé el idioma de cada uno de los textos con detectlang, pero más del 95% están en inglés; así que lo dejé de lado por lo costoso que es computacionalmente.

Ahora vamos a añadirle al conjunto de datos original la columna con el sentimiento:

In [9]:
results = pd.DataFrame(results)
df = df.join(results, lsuffix='_orig', rsuffix='_res').drop(labels = ['text_res', 'neg', 'neu', 'pos'], axis = 1)
df[0:9]

Unnamed: 0,id,gender,age,topic,sign,date,text_orig,compound
0,2059027,male,15,Student,Leo,"14,May,2004","Info has been found (+/- 100 pages,...",0.0
1,2059027,male,15,Student,Leo,"13,May,2004",These are the team members: Drewe...,0.0
2,2059027,male,15,Student,Leo,"12,May,2004",In het kader van kernfusie op aarde...,-0.8167
3,2059027,male,15,Student,Leo,"12,May,2004",testing!!! testing!!!,0.0
4,3581210,male,33,InvestmentBanking,Aquarius,"11,June,2004",Thanks to Yahoo!'s Toolbar I can ...,0.8805
5,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",I had an interesting conversation...,0.9847
6,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Somehow Coca-Cola has a way of su...,0.8929
7,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004","If anything, Korea is a country o...",0.743
8,3581210,male,33,InvestmentBanking,Aquarius,"10,June,2004",Take a read of this news article ...,-0.8248


Guardamos el resultado parcial, ya que aun paralelizando lleva una hora

In [12]:
df.to_csv('blogtext_sent.csv')

## Series de tiempo
Para construir la serie de tiempo o las series de tiempo, necesitamos crear la variable de fecha, rellenar los campos vacíos que pueda tener y sumarizar por género, grupo etáreo, tópico o signo del zodiaco.

In [None]:
for i in df['date']:
    try:
        df['date2'] = datetime.strptime(i, '%d,%B,%Y')
    except:
        df['date2'] = 'NA'

In [None]:
df[0:9]

In [81]:
datetime.strptime(df2['date'][2], '%d,%b,%Y')

datetime.datetime(2004, 5, 12, 0, 0)