#Рубежный контроль №2
##Андрианов А.А.
##ИУ5-23М


###Задание
Необходимо решить задачу классификации текстов на основе любого выбранного Вами датасета (кроме примера, который рассматривался в лекции). Классификация может быть бинарной или многоклассовой. Целевой признак из выбранного Вами датасета может иметь любой физический смысл, примером является задача анализа тональности текста.

Необходимо сформировать два варианта векторизации признаков - на основе CountVectorizer и на основе TfidfVectorizer.

В качестве классификаторов необходимо использовать два классификатора по варианту для Вашей группы:
1.   Классификатор №1 - LinearSVC
2.   Классификатор №2 - Multinomial Naive Bayes - MNB

Для каждого метода необходимо оценить качество классификации. Сделайте вывод о том, какой вариант векторизации признаков в паре с каким классификатором показал лучшее качество.

In [2]:
import numpy as np
import pandas as pd
from typing import Dict, Tuple
from scipy import stats
from sklearn.datasets import load_iris, load_boston
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_squared_log_error, median_absolute_error, r2_score 
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt

In [3]:
# Загрузка данных
# Набор твитов в пандемию с указанием эмоциональной окраски
df = pd.read_csv("data.csv", encoding='latin-1')
df = df[1:5000]
df

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,3804,48756,"ÃT: 36.319708,-82.363649",16-03-2020,As news of the regionÂs first confirmed COVID...,Positive
...,...,...,...,...,...,...
4995,8794,53746,"Nagpur, India",18-03-2020,Why is Government not transmitting benefits of...,Positive
4996,8795,53747,London,18-03-2020,"""As long as we're not seeing markets I would c...",Extremely Positive
4997,8796,53748,"Victoria, London",18-03-2020,Will school fees be refunded if the #coronavir...,Neutral
4998,8797,53749,United States,18-03-2020,"#USD continues its dominance, markets rebounde...",Negative


In [4]:
# Удаление лишних столбцов
df = df.drop('UserName', axis = 1)
df = df.drop('ScreenName', axis = 1)
df = df.drop('TweetAt', axis = 1)
df = df.drop('Location', axis = 1)
df.head()

Unnamed: 0,OriginalTweet,Sentiment
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,As news of the regionÂs first confirmed COVID...,Positive


In [5]:
# Кодирование эмоциональной окраски
df['Sentiment'] = df['Sentiment'].astype('category')
df['Sentiment'] = df['Sentiment'].cat.codes
df.head()

Unnamed: 0,OriginalTweet,Sentiment
1,advice Talk to your neighbours family to excha...,4
2,Coronavirus Australia: Woolworths to give elde...,4
3,My food stock is not the only one which is emp...,4
4,"Me, ready to go at supermarket during the #COV...",0
5,As news of the regionÂs first confirmed COVID...,4


In [6]:
df.shape

(4999, 2)

In [7]:
# Сформируем общий словарь для обучения моделей из обучающей и тестовой выборки
vocab_list = df['OriginalTweet'].tolist()
vocab_list[1:10]

['Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P',
 "My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j",
 "Me, ready to go at supermarket during the #COVID19 outbreak.\r\r\n\r\r\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\r\r\n\r\r\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n",
 'As news of the regionÂ\x92s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods,

In [8]:
vocabVect = CountVectorizer()
vocabVect.fit(vocab_list)
corpusVocab = vocabVect.vocabulary_
print('Количество сформированных признаков - {}'.format(len(corpusVocab)))

Количество сформированных признаков - 17445


In [9]:
for i in list(corpusVocab):
    print('{}={}'.format(i, corpusVocab[i]))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
9aajzcfxva=840
y3kgucr34l=17166
coronavirusoutbreak2020=3958
yv9jealpvx=17299
bbcdorset=2071
disparity=4890
proven=12233
vector=16307
hsuuldmxl4=7627
base_security=2025
specialise=14363
vacant=16245
properties=12192
ql7r8uoalx=12403
khost=8750
flours=6262
pointed=11824
hampered=7187
snjzqephdz=14209
praised=11959
reusable=13072
sacks=13383
transmit=15658
governors=6927
itis=8321
noposguau=10694
quatantineandchill=12455
7ua23qapl4=743
vwbevghfwu=16514
stawinski=14538
ame=1366
11p=106
beach=2114
presents=12018
ambulance=1365
crew=4207
frontlineheroes=6522
rks5qdl1vc=13175
obnormal=10883
rrofovisvt=13285
pittsburghpg=11707
mega=9863
statements=14527
cornavirusupdate=3882
crisiscommunications=4219
warby=16599
parker=11398
tbjlx8irh7=15120
cust=4333
serv=13777
host=7564
sol=14256
grocerydelivery=7035
ann=1455
hui=7647
kathryn=8669
baum=2056
symbolic=14998
anxieties=1499
indicative=8003
larger=8996
0pnxcgvwvi=57
omebfncvwn=1102

In [10]:
tfidfv = TfidfVectorizer(ngram_range=(1,3))
tfidf_ngram_features = tfidfv.fit_transform(vocab_list)
tfidf_ngram_features

<4999x230500 sparse matrix of type '<class 'numpy.float64'>'
	with 452561 stored elements in Compressed Sparse Row format>

In [11]:
# Размер нулевой строки
len(tfidf_ngram_features.todense()[0].getA1())

230500

In [12]:
# Непустые значения нулевой строки
[i for i in tfidf_ngram_features.todense()[0].getA1() if i>0]


[0.08401773824904957,
 0.10413616574450421,
 0.10413616574450421,
 0.09595605317581173,
 0.10413616574450421,
 0.10413616574450421,
 0.0703581658379043,
 0.10413616574450421,
 0.10413616574450421,
 0.03909376171035052,
 0.08777594060711925,
 0.10413616574450421,
 0.09935110664014131,
 0.10413616574450421,
 0.10413616574450421,
 0.06739230177449762,
 0.10413616574450421,
 0.10413616574450421,
 0.07638662735743722,
 0.10413616574450421,
 0.10413616574450421,
 0.09117099407144882,
 0.10413616574450421,
 0.10413616574450421,
 0.09117099407144882,
 0.10413616574450421,
 0.10413616574450421,
 0.06105325896634548,
 0.09117099407144882,
 0.10413616574450421,
 0.0933226451206084,
 0.10413616574450421,
 0.10413616574450421,
 0.03948152251673741,
 0.10413616574450421,
 0.10413616574450421,
 0.06524065072533805,
 0.10413616574450421,
 0.10413616574450421,
 0.08935179903049259,
 0.10413616574450421,
 0.10413616574450421,
 0.17870359806098518,
 0.10413616574450421,
 0.10413616574450421,
 0.104136165

In [13]:
# Оценка качества работы обоих способов векторизации на обоих методах классификации:
def VectorizeAndClassify(vectorizers_list, classifiers_list):
    for v in vectorizers_list:
        for c in classifiers_list:
            pipeline1 = Pipeline([("vectorizer", v), ("classifier", c)])
            score = cross_val_score(pipeline1, df['OriginalTweet'], df['Sentiment'], scoring='accuracy', cv=3).mean()
            print('Векторизация - {}'.format(v))
            print('Модель для классификации - {}'.format(c))
            print('Accuracy = {}'.format(score))
            print('===========================')

In [14]:
vectorizers_list = [TfidfVectorizer(vocabulary = corpusVocab), CountVectorizer(vocabulary = corpusVocab)]
classifiers_list = [LinearSVC(), MultinomialNB()]
VectorizeAndClassify(vectorizers_list, classifiers_list)

Векторизация - TfidfVectorizer(vocabulary={'00': 0, '000': 1, '0000009375': 2, '0000hrs': 3,
                            '008': 4, '00am': 5, '00pdsup4wb': 6, '00pm': 7,
                            '01': 8, '0101': 9, '01625': 10, '0203': 11,
                            '029': 12, '02ddkwsnxo': 13, '04': 14, '0508': 15,
                            '06': 16, '0600': 17, '0618': 18, '0645': 19,
                            '0712128888': 20, '08': 21, '0800': 22, '0808': 23,
                            '086ohsc4ox': 24, '08smp18fiq': 25,
                            '09093052802': 26, '0aaj71zczs': 27,
                            '0acif25540': 28, '0blnzayudb': 29, ...})
Модель для классификации - LinearSVC()
Accuracy = 0.4474893736738847
Векторизация - TfidfVectorizer(vocabulary={'00': 0, '000': 1, '0000009375': 2, '0000hrs': 3,
                            '008': 4, '00am': 5, '00pdsup4wb': 6, '00pm': 7,
                            '01': 8, '0101': 9, '01625': 10, '0203': 11,
             

##Вывод
Лучшее качество показал вариант векторизации TfidfVectorizer в паре с классификатором LinearSVC. Метрика accuracy составила 0.4.