# Text cleaning 

[Carlos Ortiz](https://www.linkedin.com/in/carlosortizdev/), [Sarah Santiago](https://www.linkedin.com/in/sarah-santiago-7a297b18a/), and [Vivek Datta](https://www.linkedin.com/in/vivek-datta/) did most of the coding. Jae Yeon Kim reviewed and modified the code slightly. Please use the `Python3` kernel to run this notebook.

## Import libraries

In [1]:
#Include relevant imports here

import re
import string
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# Data manipulation 
import numpy as np
import warnings
import pandas as pd
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import StandardScaler

# Data visualization 
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# ML
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import PCA

warnings.filterwarnings('ignore')

[nltk_data] Downloading package stopwords to /home/jae/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Import data 

In [2]:
# Read in scraped articles from csv file to dataframe
articles = pd.read_csv('/home/jae/ITS-Text-Classification/raw_data/sample_articles.csv')

## Clean data

In [3]:

# Select relevant columns and convert date column into Datetime format

articles = articles[['text', 'source', 'group', 'date', 'intervention', 'expanding', 'distancing', 'assimilating']]

articles['date'] =  pd.to_datetime(articles['date'], format='%Y%m%d')

articles.head()

Unnamed: 0,text,source,group,date,intervention,expanding,distancing,assimilating
0,"Third Rate Or Not, No Meddling Please! By UPEN...",News India - Times,Indian,1997-10-24,post,0.0,1.0,0.0
1,Phagwah Parade Draws Many By DHARMVIR GEHLAUT ...,News India - Times,Indian,2001-03-23,post,,,
2,"Advani Blames Congress And UF DALTONGUNJ, Biha...",News India - Times,Indian,1998-01-23,post,,,
3,Violence Feared During Assembly Elections By N...,News India - Times,Indian,2001-01-05,post,,,
4,Indrani Rahman's Final Bow By ARUN A. AGUIAR P...,News India - Times,Indian,1999-02-19,post,,,


### Create a response variable 

In [4]:

# Adding column to determine if there is relevant info (expanding, distancing, assimilating) or not 

response = []

for x in articles['assimilating'].isnull().values:
    if x == True:
        response.append(0)
    else:
        response.append(1)

# Create new binary column called category based on this
# 1 indicates domestic issue, 0 indicates non-domestic issue

articles['category'] = response

articles.head()

Unnamed: 0,text,source,group,date,intervention,expanding,distancing,assimilating,category
0,"Third Rate Or Not, No Meddling Please! By UPEN...",News India - Times,Indian,1997-10-24,post,0.0,1.0,0.0,1
1,Phagwah Parade Draws Many By DHARMVIR GEHLAUT ...,News India - Times,Indian,2001-03-23,post,,,,0
2,"Advani Blames Congress And UF DALTONGUNJ, Biha...",News India - Times,Indian,1998-01-23,post,,,,0
3,Violence Feared During Assembly Elections By N...,News India - Times,Indian,2001-01-05,post,,,,0
4,Indrani Rahman's Final Bow By ARUN A. AGUIAR P...,News India - Times,Indian,1999-02-19,post,,,,0


### Check the balance of classes

To check whether we need to worry about imbalanced classification problems.

In [5]:
articles['category'].value_counts()

1    574
0    441
Name: category, dtype: int64

## Export cleaned corpus 

In [6]:
articles.to_csv("/home/jae/ITS-Text-Classification/processed_data/cleaned_text.csv")