# GMO Discourse Analysis 

This notebooks presents an exercise for the validation of an automated methodology aimed at classifying textual data into different types of discourses in the context of GMO's, based on classical text mining techniques such as _bag of words_ standard _word vectorization_ and _tf-idf_. The following Python libraries will be used:

 - [**scikit-learn**] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.
 
 - [**nltk**] Loper, E., & Bird, S. (2002, July). NLTK: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics-Volume 1 (pp. 63-70). Association for Computational Linguistics.

The theoretical background is mainly based on the following texts:

 - Fontoura, Y. S. D. R. D. (2015). International civil society actors in Genetically Modificied Organisms as a field of struggle: a neo-gramscian study in Brazil and the United Kingdom (Doctoral dissertation).
 
 - Levy, David & Reinecke, Juliane & Manning, Stephan. (2016). The Political Dynamics of Sustainable Coffee: Contested Value Regimes and the Transformation of Sustainability. Journal of Management Studies. 53. 364-401. 10.1111/joms.12144. 

Author: Lucas Farias

Supervision: Yuna Fontoura and Jefferson Santos


### libraries

In [2]:
import os
import csv
import pandas as pd

import collections, re

In [25]:
# use nltk for removing stopwords

In [26]:
import nltk

### get text data

In [3]:
# current dir
os.getcwd()

'C:\\Users\\LucasFariasLima\\Dropbox\\work'

In [4]:
txts = os.listdir('docs')
docs = {}

for doc in txts:
    file = open('docs/'+str(doc), 'r', encoding="utf8")
    docs["{0}".format(doc).strip('.txt')] = file.read()#.split()
    file.close()

### clean text data

In [29]:
def to_unicode(data):
    
    '''
    transforms text data to unicode
    ''' 
    
    if type(data)==str:
        data = data
        print(type(data))
    else:    
        try:
            data = data.decode('utf-8')
        except (UnicodeDecodeError, UnicodeEncodeError):
            try:
                data = data.decode('iso-8859-1')
            except (UnicodeDecodeError, UnicodeEncodeError):
                try:
                    data = data.decode('latin-1')
                except (UnicodeDecodeError, UnicodeEncodeError):
                    data = data
        
    return data

In [30]:
def remove_nonlatin(string): 
    
    '''
    removes non-latin characters and newlines
    '''
    
    new_chars = []
    for char in string:
        if char == '\n':
            new_chars.append(' ')
            continue
        try:
            if unicodedata.name(unicode(char)).startswith(('LATIN', 'SPACE')):
                new_chars.append(char)
        except:
            try:
                if unicodedata.name(char).startswith(('LATIN', 'SPACE')):
                    new_chars.append(char)
            except:
                continue
    return ''.join(new_chars)

### set signifiers list

#### original terms:

#### bag of words

In [27]:
bagsofwords = [ collections.Counter(re.findall(r'\w+', txt)) for txt in data_corpus]

In [28]:
bagsofwords

[Counter({'Cerca': 1,
          'de': 29,
          '30': 1,
          'mulheres': 9,
          'do': 12,
          'território': 1,
          'da': 25,
          'Borborema': 3,
          'se': 4,
          'encontraram': 1,
          'no': 3,
          'dia': 1,
          '21': 1,
          'agosto': 1,
          'na': 3,
          'casa': 3,
          'guardiã': 2,
          'das': 4,
          'raças': 2,
          'nativas': 1,
          'galinhas': 5,
          'Márcia': 6,
          'Patrícia': 4,
          'Silva': 1,
          'comunidade': 3,
          'Lutador': 1,
          'município': 2,
          'Queimadas': 1,
          'PB': 1,
          'Esse': 2,
          'encontro': 2,
          'teve': 1,
          'por': 1,
          'objetivo': 1,
          'criar': 3,
          'um': 6,
          'ambiente': 1,
          'troca': 2,
          'conhecimentos': 2,
          'sobre': 3,
          'o': 11,
          'manejo': 2,
          'criação': 3,
          'capoeira': 2,
   

In [9]:
i = 5
{key: rank for rank, key in enumerate(sorted(bagsofwords[i], key=bagsofwords[i].get, reverse=True), 1)}

{'de': 1,
 'e': 2,
 'o': 3,
 'a': 4,
 'que': 5,
 'soja': 6,
 'para': 7,
 'do': 8,
 'RTRS': 9,
 'da': 10,
 'com': 11,
 'os': 12,
 'em': 13,
 'no': 14,
 'as': 15,
 'uma': 16,
 'milhões': 17,
 'toneladas': 18,
 'produção': 19,
 'é': 20,
 'boas': 21,
 'trabalho': 22,
 'desmatamento': 23,
 'setor': 24,
 'padrão': 25,
 'dos': 26,
 'como': 27,
 'Brasil': 28,
 'deve': 29,
 'na': 30,
 'safra': 31,
 'um': 32,
 'país': 33,
 'grão': 34,
 'A': 35,
 'padrões': 36,
 'mais': 37,
 'certificação': 38,
 'produtor': 39,
 'práticas': 40,
 'negócios': 41,
 'oferecer': 42,
 'criar': 43,
 'locais': 44,
 'meio': 45,
 'ambiente': 46,
 'quatro': 47,
 'essa': 48,
 'qualquer': 49,
 'nosso': 50,
 'Lidl': 51,
 'créditos': 52,
 'toda': 53,
 'suprimentos': 54,
 'mesmo': 55,
 'qualidade': 56,
 'vegetação': 57,
 'Recentemente': 58,
 'pesquisa': 59,
 'agência': 60,
 'Reuters': 61,
 'divulgou': 62,
 'superar': 63,
 'marca': 64,
 '120': 65,
 '2018': 66,
 '19': 67,
 'aumento': 68,
 '2': 69,
 '8': 70,
 'área': 71,
 'plantada