# <b>Trabalho 1 - Lista de Exercícios - 27/08/2018</b>
<br><br>
**Atenção**
<ul>
    <li>Os datasets a serem utilizados neste trabalho estão no FibOnline e na rede, na pasta "Basseto/Farina"
    <li>Após o término do trabalho submeter no FibOnline no formato Jupyter Notebook (.ipynb)
    <li>O prazo final para envio é 03/09/2018, às 19:00 h. 
</ul>

## <b>Atividade 1</b>
Desenvolva um modelo preditivo baseado no classificador **KNN** para que, dado as característica uma semente de trigo, consiga classificar entre os tipos Kama, Rosa e Canadian. Considere o dataset **seeds** para o seu modelo.<br>
Calcule a <b>acurácia</b> do seu modelo alterando a quantidade de características (1 a 7) da semente bem como a quantidade de vizinhos a considerar (1 a 13)

**Sobre o Dataset:**

A base de dados “seeds” foi obtida do repositório da UCI (https://archive.ics.uci.edu/ml/datasets/seeds), no formato textual (.txt). Ela é composta por 210 instâncias pertencentes a três classes, 70 instâncias cada, correspondendo respectivamente a cada uma das variedades de trigo: Kama, Rosa e Canadian. A tabela abaixo mostra os atributos de cada classe que correspondem às características do grão de trigo, totalizando sete atributos do tipo “real” e mais o campo enumerativo correspondente à classe (variedade do trigo).
<br><br>
Atributos de cada instância:
1. Área 
2. Perímetro 
3. Compacidade (ou compactação),
4. Comprimento do grão
5. Largura do grão
6. Coeficiente de assimetria
7. Comprimento do sulco do grão

In [1]:
from collections import Counter
from collections import defaultdict
from scipy.spatial import distance
import matplotlib.pyplot as plt
import re

import random
import numpy as np
import math
data = np.genfromtxt('H:\\datasets\\seeds_dataset.txt', delimiter='', usecols=(0,1,2,3,4,5,6,7))


seeds = [   ([c0,c1,c2,c3,c4,c5,c6],label) for c0, c1, c2, c3, c4, c5, c6, label in data]

def majority_vote(labels):
    """assumes that labels are ordered from nearest to farthest"""
    vote_counts = Counter(labels)
    winner, winner_count = vote_counts.most_common(1)[0]
    num_winners = len([count for count in vote_counts.values() if count == winner_count])
    if num_winners == 1:
        return winner # unique winner, so return it
    else:
        return majority_vote(labels[:-1]) # try again without the farthest

def knn_classify(k, labeled_points, new_point):
    """each labeled point should be a pair (point, label)"""
    # order the labeled points from nearest to farthest
    by_distance = sorted(labeled_points, key=lambda point: distance.euclidean(point[0], new_point))
    # find the labels for the k closest
    k_nearest_labels = [label for _, label in by_distance[:k]]
    # and let them vote
    return majority_vote(k_nearest_labels)


In [2]:
knn_classify(7, seeds, [20.24, 16.91, 0.8897, 6.315, 3.962, 5.901, 6.188])

for k in range(1,14):
    
    for attr in range(1,8):
        num_correct = 0
        
        for seed in seeds:
            
            location, actual_language = seed
            other_seeds = [(carac[0:attr], label) for carac, label in seeds if (carac, label) != seed]
            
            predicted_language = knn_classify(k, other_seeds, location[0:attr])
            if predicted_language == actual_language:
                num_correct += 1
        print (k, "neighbor[s]:", attr, "atributes[s]:", num_correct, "correct out of", len(seeds))

1 neighbor[s]: 1 atributes[s]: 175 correct out of 210
1 neighbor[s]: 2 atributes[s]: 172 correct out of 210
1 neighbor[s]: 3 atributes[s]: 172 correct out of 210
1 neighbor[s]: 4 atributes[s]: 175 correct out of 210
1 neighbor[s]: 5 atributes[s]: 175 correct out of 210
1 neighbor[s]: 6 atributes[s]: 189 correct out of 210
1 neighbor[s]: 7 atributes[s]: 190 correct out of 210
2 neighbor[s]: 1 atributes[s]: 175 correct out of 210
2 neighbor[s]: 2 atributes[s]: 172 correct out of 210
2 neighbor[s]: 3 atributes[s]: 172 correct out of 210
2 neighbor[s]: 4 atributes[s]: 175 correct out of 210
2 neighbor[s]: 5 atributes[s]: 175 correct out of 210
2 neighbor[s]: 6 atributes[s]: 189 correct out of 210
2 neighbor[s]: 7 atributes[s]: 190 correct out of 210
3 neighbor[s]: 1 atributes[s]: 179 correct out of 210
3 neighbor[s]: 2 atributes[s]: 177 correct out of 210
3 neighbor[s]: 3 atributes[s]: 177 correct out of 210
3 neighbor[s]: 4 atributes[s]: 171 correct out of 210
3 neighbor[s]: 5 atributes[s

<br><br>

## <b>Atividade 2</b>
Desenvolva um modelo preditivo baseado no classificador **Naive Bayes** para que, dado a letra de uma música, consiga classificar entre os gêneros **Bossa Nova, Funk, Sertanejo e Gospel**. O dataset com as músicas será fornecido, ele é composto de letras de músicas de acordo com a sua classificação de gênero.<br>
Exiba os valores do <b>Precision, Recall e F1-Score</b>.

In [16]:
def tokenize(message):
    message = message.lower() # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message) # extract the words
    return set(all_words) # remove duplicates

def count_words(training_set):
    """training set consists of pairs (message, is_spam)"""
    counts = defaultdict(lambda: [0, 0, 0, 0])
    for lyrics, genero in training_set:
        for word in tokenize(lyrics):
            counts[word][genero] += 1
    return counts

In [23]:
def word_probabilities(counts, total_bossa, total_funk, total_sertanejo, total_gospel, k=0.5):
    """turn the word_counts into a list of triplets w, p(w | spam) and p(w | ~spam)"""
    return [(w,
            (bossa + k) / (total_bossa + 2 * k),
            (funk + k) / (total_funk + 2 * k),
            (sertanejo + k) / (total_sertanejo + 2 * k),
            (gospel + k) / (total_gospel + 2 * k))
            for w, (bossa, funk, sertanejo, gospel) in counts.items()]

In [18]:
def spam_probability(word_probs, message):
    message_words = tokenize(message)
    log_prob_if_bossa = log_prob_if_funk = log_prob_if_sertanejo = log_prob_if_gospel = 0.0
    
    # iterate through each word in our vocabulary
    for word, prob_if_bossa, prob_if_funk, prob_if_sertanejo, prob_if_gospel in word_probs:
        # if *word* appears in the message,
        # add the log probability of seeing it
        if word in message_words:
            log_prob_if_bossa += math.log(prob_if_bossa)
            log_prob_if_funk += math.log(prob_if_funk)
            log_prob_if_sertanejo += math.log(prob_if_sertanejo)
            log_prob_if_gospel += math.log(prob_if_gospel)
        # if *word* doesn't appear in the message
        # add the log probability of _not_ seeing it
        # which is log(1 - probability of seeing it)
        else:
            log_prob_if_bossa += math.log(1.0 - prob_if_bossa)
            log_prob_if_funk += math.log(1.0 - prob_if_funk)
            log_prob_if_sertanejo += math.log(1.0 - prob_if_sertanejo)
            log_prob_if_gospel += math.log(1.0 - prob_if_gospel)
    prob_if_bossa = math.exp(log_prob_if_bossa)
    prob_if_funk = math.exp(log_prob_if_funk)
    prob_if_sertanejo = math.exp(log_prob_if_sertanejo)
    prob_if_gospel = math.exp(log_prob_if_gospel)
    prob_total = prob_if_bossa + prob_if_funk + prob_if_sertanejo + prob_if_gospel
    if prob_total==0: return [0,0,0,0]
    resultado = [prob_if_bossa / prob_total, prob_if_funk / prob_total, prob_if_sertanejo / prob_total, prob_if_gospel / prob_total]
    return resultado

In [24]:
import glob, re
# modify the path with wherever you've put the files
path = r"H:\datasets\lyrics\*"
data = []
# glob.glob returns every filename that matches the wildcarded path
for fn in glob.glob(path):
    lyric = ''
    genre = 0 if "bossa" in fn else 1 if "funk" in fn else 2 if "sertanejo" in fn else 3 if "gospel" in fn else -1
    is_spam = "ham" not in fn
    
    with open(fn,'r', encoding="utf8", errors='ignore') as file:
        for line in file:
            if line.startswith("lyric"):
                continue
            elif line.startswith('"'):
                if not lyric:
                    continue
                else:
                    data.append((lyric, genre))
                    lyric = ''
            else:
                lyric += line.replace('"', '').replace("'", "").rstrip().lstrip() + ' '
                

                

class NaiveBayesClassifier:
    def __init__(self, k=0.5):
        self.k = k
        self.word_probs = []
    def train(self, training_set):
        # count spam and non-spam messages
        num_bossa = len([is_bossa
                        for lyric, is_bossa in training_set
                        if is_bossa == 0])
        num_funk = len([is_funk
                        for lyric, is_funk in training_set
                        if is_funk == 1])
        num_sertanejo = len([is_sertanejo
                        for lyric, is_sertanejo in training_set
                        if is_sertanejo == 2])
        num_gospel = len([is_gospel
                        for lyric, is_gospel in training_set
                        if is_gospel == 3])
        
        
        # run training data through our "pipeline"
        word_counts = count_words(training_set)
        self.word_probs = word_probabilities(word_counts,
                                            num_bossa,
                                            num_funk,
                                            num_sertanejo,
                                            num_gospel,
                                            self.k)
    def classify(self, message):
        return spam_probability(self.word_probs, message)             

In [25]:
import random
from collections import defaultdict
import math
def split_data(data, prob):
    """dividi os dados em fracoes [prob, 1 - prob]"""
    results = [], []
    for row in data:
        results[0 if random.random() < prob else 1].append(row)
    return results

random.seed(0)
train_data, test_data = split_data(data, 0.75)

In [26]:
classifier = NaiveBayesClassifier()
classifier.train(train_data)

In [66]:
classificacao = classifier.classify("Não conte os teus maiores sonhos a ninguém Não mostre a sua ferida para quem não tem Remédio pra curá-la e forças para te erguer Não não não não")
print(classificacao)


[0.0007036755008242429, 2.419451923912453e-07, 0.00018227937122204449, 0.9991138031827613]


In [67]:
v = [num for num in range(4)]

print v

[0, 1, 2, 3]


In [27]:
from collections import Counter







counts = Counter([ (genero == gen, classifier.classify(lyric)[gen] > 0.5 )
          for lyric, genero in test_data for gen in range(4)])



In [28]:

print (counts)

Counter({(False, False): 2345, (True, True): 680, (True, False): 146, (False, True): 133})


In [29]:

def precision(tp, fp, fn, tn):
    return tp / (tp + fp)

def recall(tp, fp, fn, tn):
    return tp / (tp + fn)

def f1_score(tp, fp, fn, tn):
    p = precision(tp, fp, fn, tn)
    r = recall(tp, fp, fn, tn)
    return 2 * p * r / (p + r)

# tp=tt fp=ft fn=tf tn=ff

print (precision(680,133,146,2345))
print (recall(680,133,146,2345))
print (f1_score(680,133,146,2345))

0.8364083640836408
0.8232445520581114
0.8297742525930446


In [None]:
dict_prob = dict(Counter((genre, 0 if probs[0] > 0.5 else 1 ifprobs[1] > 0.5 else 2 if probs[2] > 0.5 else 3 if probs[3] > 0.5 else 5)
                        for _, genre, probs in classified))
cm = [[0 for x in range(4)] for y in range(4)]