# GMO Discourse Analysis 

This notebooks presents an exercise for the validation of an automated methodology aimed at classifying textual data into different types of discourses in the context of GMO's, based on classical text mining techniques such as _bag of words_ standard _word vectorization_ and _tf-idf_. The following Python libraries will be used:

 - [**scikit-learn**] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.
 
 - [**nltk**] Loper, E., & Bird, S. (2002, July). NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1 (pp. 63-70). Association for Computational Linguistics.

The theoretical background is mainly based on the following texts:

 - Fontoura, Y. S. D. R. D. (2015). International civil society actors in Genetically Modificied Organisms as a field of struggle: a neo-gramscian study in Brazil and the United Kingdom (Doctoral dissertation).
 
 - Levy, David & Reinecke, Juliane & Manning, Stephan. (2016). The Political Dynamics of Sustainable Coffee: Contested Value Regimes and the Transformation of Sustainability. Journal of Management Studies. 53. 364-401. 10.1111/joms.12144. 

Author: Lucas Farias

Supervision: Yuna Fontoura and Jefferson Santos


### libraries

In [1]:
import os
import csv
import pandas as pd
import collections, re
import nltk
import glob
from tqdm import tqdm_notebook
import regex
import unicodedata

### functions to clean text data

In [2]:
def to_unicode(data):
    
    '''
    transforms text data to unicode
    ''' 
    
    if type(data)==str:
        data = data
        print(type(data))
    else:    
        try:
            data = data.decode('utf-8')
        except (UnicodeDecodeError, UnicodeEncodeError):
            try:
                data = data.decode('iso-8859-1')
            except (UnicodeDecodeError, UnicodeEncodeError):
                try:
                    data = data.decode('latin-1')
                except (UnicodeDecodeError, UnicodeEncodeError):
                    data = data
        
    return data

In [3]:
def remove_nonlatin(string): 
    
    '''
    removes non-latin characters and newlines
    '''
    
    new_chars = []
    for char in string:
        if char == '\n':
            new_chars.append(' ')
            continue
        try:
            if unicodedata.name(unicode(char)).startswith(('LATIN', 'SPACE')):
                new_chars.append(char)
        except:
            try:
                if unicodedata.name(char).startswith(('LATIN', 'SPACE')):
                    new_chars.append(char)
            except:
                continue
    return ''.join(new_chars)

## Analysis of <span style="color:red"> PRO </span> discuss

### get text data

In [4]:
# current dir
articles_path = os.getcwd()+'\\articles\\repsoy\\'

In [5]:
len(os.listdir(articles_path))

367

In [6]:
read_files = glob.glob(articles_path + "\\*.txt")
all_articles_path = articles_path + "all_articles.txt"

with open(all_articles_path, "w", encoding="utf-8") as outfile:
    for f in tqdm_notebook(read_files):
        with open(f, "r", encoding="utf-8") as infile:
#             print('\n\n\n'+ 10*'=' + '\n\n\n', infile.read())
            outfile.write(infile.read())

HBox(children=(IntProgress(value=0, max=367), HTML(value='')))




In [7]:
with open(all_articles_path, 'r', encoding="utf-8") as f:
    text_data = f.read()

## preprocess text data

In [8]:
text_data = remove_nonlatin(text_data)