# №1. Сбор данных, реализация скриптов для чистки данных

Выполнение №1 разбито на 2 части:
* №1.1 - сбор данных;
* №1.2 - чистка данных.

## №1.2.  Чистка данных (продолжение)

Продолжение выполнения очистки документов от не релевантной информации:
* удаление стоп-слов;
* выбор признаков документа.

После проведенной обработки размер текста для каждого твита должен быть **от 50** символов. 

Объем собранной коллекции не менее **100 тысяч** уникальных документов:
* до чистки 175.585 документов;
* после первичной очистки 100.233 документа;
* после удаления стоп-слов 79.995 документов.

Результатом является: 
* инструмент очистки текстов от стоп-слов;  
* коллекция документов после удаления стоп-слов.

In [115]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.corpus import stopwords

### Загрузка коллекции документов после первичной очистки

In [116]:
PATH = "C:/Users/LADA/Inf_Search"
data_path = "{}/{}".format(PATH, "data")

In [117]:
first_cleared_data_file = 'cleared_data.csv'

In [118]:
df = pd.read_csv("{}/{}".format(data_path, first_cleared_data_file))

In [119]:
df.columns

Index(['Unnamed: 0', 'id', 'conversation_id', 'created_at', 'date', 'time',
       'timezone', 'user_id', 'username', 'name', 'place', 'tweet', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video', 'near',
       'geo', 'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to',
       'retweet_date', 'translate', 'trans_src', 'trans_dest'],
      dtype='object')

In [120]:
# Переименование столбца с первоначальным номером документа
df.rename(columns=lambda x: x.replace('Unnamed: 0', 'initial_num'), inplace=True)

In [121]:
df.columns

Index(['initial_num', 'id', 'conversation_id', 'created_at', 'date', 'time',
       'timezone', 'user_id', 'username', 'name', 'place', 'tweet', 'mentions',
       'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count',
       'hashtags', 'cashtags', 'link', 'retweet', 'quote_url', 'video', 'near',
       'geo', 'source', 'user_rt_id', 'user_rt', 'retweet_id', 'reply_to',
       'retweet_date', 'translate', 'trans_src', 'trans_dest'],
      dtype='object')

### Подсчет документов в коллекции до удаления стоп-слов

In [122]:
def get_total_number_of_docs(dataframe):
    return dataframe.shape[0]

def get_total_number_of_docs_by_users(dataframe):
    return dataframe.groupby('name').size()

In [123]:
get_total_number_of_docs(df)

100233

In [124]:
get_total_number_of_docs_by_users(df)

name
Ariana Grande           3312
Barack Obama            6230
Britney Spears          2924
CNN Breaking News      11840
Cristiano Ronaldo       1894
Donald J. Trump         9568
Ellen DeGeneres         7912
Justin Bieber           3560
Justin Timberlake       2005
KATY PERRY              6098
Kim Kardashian West     4465
Lady Gaga               5320
Narendra Modi          10481
Rihanna                 4445
Selena Gomez            2137
Shakira                 4283
Taylor Swift             279
Team Demi               6341
Twitter                 1687
YouTube                 5452
dtype: int64

### Удаление стоп-слов из текста твитов

In [125]:
nltk.download(['punkt','stopwords'])
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\LADA\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\LADA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [126]:
print(stop_words)

{'more', 'here', 'will', 'which', 'ma', "wouldn't", 'or', 'to', 'when', 'on', 'up', 'couldn', 'shan', 'y', 'by', "you'd", 'too', 'doing', 'how', 'out', 'themselves', 'him', 'ours', "shouldn't", 'having', 'with', 'both', 'but', 'some', 'didn', 'no', 'do', "weren't", 'off', 'we', 'own', 'mustn', "isn't", 'in', 'just', 'herself', "you're", 'what', 'such', "hadn't", 'down', "mustn't", "hasn't", "shan't", 'where', 'whom', 'himself', 'and', 'between', 'isn', "you'll", 'who', 'that', 'hasn', 'an', 'wasn', 'hadn', "wasn't", 'were', 'weren', 'once', 'me', 'of', 'been', 'into', 'not', 'haven', "haven't", 'have', "needn't", 'because', 'yourself', 'those', 'its', 'the', 'i', 'from', 'same', 'aren', 'they', 'than', 'needn', 'wouldn', 'nor', 'them', "that'll", 'being', 'had', 'further', 'yourselves', 'you', 'for', 'his', 'o', "couldn't", 'these', "mightn't", 'then', "won't", 'after', 'through', 'should', 'll', 'myself', 'itself', "doesn't", 'a', 'my', 'over', 'ourselves', 'if', 'mightn', 'while', 'f

In [127]:
len(stop_words)

179

In [128]:
df['clean_tweet'] = df['tweet'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

In [129]:
df[['tweet', 'clean_tweet']]

Unnamed: 0,tweet,clean_tweet
0,i believe that where you live shouldn t affect...,believe live affect access life saving treatme...
1,congratulations to my cousin isamebarak for he...,congratulations cousin isamebarak new album fu...
2,felicidades a mi prima hermana isamebarak por ...,felicidades mi prima hermana isamebarak por su...
3,just heard that the waka video has reached bil...,heard waka video reached billion views incredi...
4,who s excited about shak s special guest appea...,excited shak special guest appearance tonight ...
...,...,...
100228,time to vote grab a friend to join you and hea...,time vote grab friend join head polling place
100229,at the final rally of his final campaign last ...,final rally final campaign last night presiden...
100230,it s election day this is your last chance to ...,election day last chance help win thing rt lin...
100231,election day is here confirm your polling plac...,election day confirm polling place bring frien...


### Выбор столбцов с non-null значениями

In [130]:
df.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100233 entries, 0 to 100232
Data columns (total 36 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   initial_num      100233 non-null  int64  
 1   id               100233 non-null  int64  
 2   conversation_id  100233 non-null  int64  
 3   created_at       100233 non-null  int64  
 4   date             100233 non-null  object 
 5   time             100233 non-null  object 
 6   timezone         100233 non-null  object 
 7   user_id          100233 non-null  int64  
 8   username         100233 non-null  object 
 9   name             100233 non-null  object 
 10  place            33 non-null      object 
 11  tweet            100233 non-null  object 
 12  mentions         100233 non-null  object 
 13  urls             100233 non-null  object 
 14  photos           100233 non-null  object 
 15  replies_count    100233 non-null  int64  
 16  retweets_count   100233 non-null  int6

Как следует из таблицы, non-null значения преобладают только в столбцах некоторых признаков, часть из них будет участвовать дальше.

In [133]:
df_non_null = df[['initial_num', 'id', 'conversation_id', 'created_at', 'date', 'time', 'timezone', 'user_id', 'username', 'name', 'clean_tweet', 'mentions', 'urls', 'photos', 'replies_count', 'retweets_count', 'likes_count', 'hashtags', 'link', 'quote_url', 'video', 'reply_to']]

In [134]:
df_non_null.info(null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100233 entries, 0 to 100232
Data columns (total 22 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   initial_num      100233 non-null  int64 
 1   id               100233 non-null  int64 
 2   conversation_id  100233 non-null  int64 
 3   created_at       100233 non-null  int64 
 4   date             100233 non-null  object
 5   time             100233 non-null  object
 6   timezone         100233 non-null  object
 7   user_id          100233 non-null  int64 
 8   username         100233 non-null  object
 9   name             100233 non-null  object
 10  clean_tweet      100233 non-null  object
 11  mentions         100233 non-null  object
 12  urls             100233 non-null  object
 13  photos           100233 non-null  object
 14  replies_count    100233 non-null  int64 
 15  retweets_count   100233 non-null  int64 
 16  likes_count      100233 non-null  int64 
 17  hashtags  

### Удаление текстов твитов из менее, чем 50 символов

In [135]:
new_df = df_non_null[df_non_null['clean_tweet'].apply(lambda tweets: sum(len(tweet) for tweet in tweets.split()) >= 50)]

In [137]:
new_df[['name', 'date', 'clean_tweet', 'hashtags', 'replies_count', 'retweets_count', 'likes_count']][:10]

Unnamed: 0,name,date,clean_tweet,hashtags,replies_count,retweets_count,likes_count
0,Shakira,2020-05-28,believe live affect access life saving treatme...,['#globalgoalunite'],124,487,2834
1,Shakira,2020-05-27,congratulations cousin isamebarak new album fu...,[],30,115,1021
2,Shakira,2020-05-27,felicidades mi prima hermana isamebarak por su...,[],117,408,3387
3,Shakira,2020-05-25,heard waka video reached billion views incredi...,[],788,3912,34161
4,Shakira,2020-05-20,excited shak special guest appearance tonight ...,['#thevoice'],139,453,3817
5,Shakira,2020-05-15,felicidades de nuestro equipo fpiesdescalzos e...,['#díadelmaestro'],27,217,2180
6,Shakira,2020-05-15,los maestros son los pilares de la sociedad co...,[],87,642,5492
7,Shakira,2020-05-14,mustered strength put makeup day job homeschoo...,['#smallvictories'],646,1326,26612
8,Shakira,2020-05-11,check shak performance try everything zootopia...,"['#disneyfamilysingalong', '#disneyfamilysinga...",162,787,4589
9,Shakira,2020-05-11,check disney family singalong today abcnetwork...,['#disneyfamilysingalong'],179,576,5889


### Запись твитов без стоп-слов в файл формата csv

In [139]:
data_without_stopwords_file = 'data_without_stopwords.csv'

In [140]:
new_df.to_csv("{}/{}".format(data_path, data_without_stopwords_file))

### Просмотр и подсчет обработанных твитов 

In [141]:
df_without_stopwords = pd.read_csv("{}/{}".format(data_path, data_without_stopwords_file))

In [146]:
df_without_stopwords.columns

Index(['Unnamed: 0', 'initial_num', 'id', 'conversation_id', 'created_at',
       'date', 'time', 'timezone', 'user_id', 'username', 'name',
       'clean_tweet', 'mentions', 'urls', 'photos', 'replies_count',
       'retweets_count', 'likes_count', 'hashtags', 'link', 'quote_url',
       'video', 'reply_to'],
      dtype='object')

In [148]:
df_without_stopwords[['id', 'name', 'date', 'time', 'clean_tweet', 'mentions', 'hashtags', 'replies_count', 'retweets_count', 'likes_count']]

Unnamed: 0,id,name,date,time,clean_tweet,mentions,hashtags,replies_count,retweets_count,likes_count
0,1266078954286350337,Shakira,2020-05-28,21:48:39,believe live affect access life saving treatme...,['glblctzn'],['#globalgoalunite'],124,487,2834
1,1265729123302944768,Shakira,2020-05-27,22:38:33,congratulations cousin isamebarak new album fu...,['isamebarak'],[],30,115,1021
2,1265712616804057090,Shakira,2020-05-27,21:32:58,felicidades mi prima hermana isamebarak por su...,['isamebarak'],[],117,408,3387
3,1264925485626208264,Shakira,2020-05-25,17:25:11,heard waka video reached billion views incredi...,[],[],788,3912,34161
4,1262876482751389696,Shakira,2020-05-20,01:43:11,excited shak special guest appearance tonight ...,['nbcthevoice'],['#thevoice'],139,453,3817
...,...,...,...,...,...,...,...,...,...,...
79990,265817228384034816,Barack Obama,2012-11-06,17:05:40,amanda ak supporting president obama college s...,[],[],142,1331,393
79991,265816369537351681,Barack Obama,2012-11-06,17:02:15,check voters reasons supporting president obam...,[],['#voteobama'],168,1077,309
79992,265798182842273792,Barack Obama,2012-11-06,15:49:59,final rally final campaign last night presiden...,[],[],213,1348,492
79993,265785848665100289,Barack Obama,2012-11-06,15:00:59,election day last chance help win thing rt lin...,[],[],456,8550,677


In [149]:
get_total_number_of_docs_by_users(df_without_stopwords)

name
Ariana Grande           2215
Barack Obama            5171
Britney Spears          2187
CNN Breaking News      11514
Cristiano Ronaldo       1362
Donald J. Trump         8957
Ellen DeGeneres         5158
Justin Bieber           2411
Justin Timberlake       1582
KATY PERRY              4750
Kim Kardashian West     3231
Lady Gaga               4294
Narendra Modi           9809
Rihanna                 3345
Selena Gomez            1318
Shakira                 3882
Taylor Swift             214
Team Demi               4008
Twitter                  967
YouTube                 3620
dtype: int64