# Data Cleaning

## Introduction

Data Cleaning is a very **important step** in any Data Science project. Data Cleaning aims to remove any irrelevant information so that the following steps are faster and with more reliable results

In this Jupyter Notebook we will:
1. **Get data**
2. **Clean data**
3. **Organize data**

The output of Notebook should be a **Corpus** (Collection of text) and a **Document-Term Matrix** (word counts in matrix format).

## 1. Getting the data

**Libraries** used:
- Urllib.request
- BeautifulSoup from bs4
- pickle

I recommend to check basic tutorials on how to use some libraries here:

 - [Beautiful Soup Basic Tutorial](https://github.com/gonzalo-cordova-pou/NLP-familiarization/blob/master/Beautiful%20Soup%20Basic%20Tutorial.ipynb)
 - [PyPDF2 - Extract Text from PDF Files](https://github.com/gonzalo-cordova-pou/NLP-familiarization/blob/master/PyPDF2%20-%20Extract%20Text%20from%20PDF%20Files.ipynb)

In [1]:
import bs4 as bs
import urllib.request
import pickle

def url_transcription(url):
    sauce = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(sauce, 'lxml')
    text = [p.text for p in soup.find_all('p')]
    print(url)
    return text


# URLs of Wikipedia info about the inventions
urls = ['https://en.wikipedia.org/wiki/Wheel',
        'https://en.wikipedia.org/wiki/Telephone',
        'https://en.wikipedia.org/wiki/Radio']

# Inventions names
inventions = ['Wheel', 'Telephone', 'Radio']

transcripts = [url_transcription(link) for link in urls]

!mkdir information #new directory

for i, c in enumerate(inventions):
    with open("information/" + c + ".txt", "wb") as file:
         pickle.dump(transcripts[i], file)

# Load pickled files
data = {}
for i, c in enumerate(inventions):
    with open("information/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

https://en.wikipedia.org/wiki/Wheel
https://en.wikipedia.org/wiki/Telephone
https://en.wikipedia.org/wiki/Radio
mkdir: no se puede crear el directorio «information»: El archivo ya existe


In [2]:
data.keys()

dict_keys(['Wheel', 'Telephone', 'Radio'])

In [3]:
data['Radio'][:2]

['Radio is the technology of signaling and communicating using radio waves.[1][2][3] Radio waves are electromagnetic waves of frequency between 30\xa0hertz (Hz) and 300\xa0gigahertz (GHz). They are generated by an electronic device called a transmitter connected to an antenna which radiates the waves, and received by a radio receiver connected to another antenna. Radio is very widely used in modern technology, in radio communication, radar, radio navigation, remote control, remote sensing and other applications.\n',
 "In radio communication, used in radio and television broadcasting, cell phones, two-way radios, wireless networking and satellite communication among numerous other uses, radio waves are used to carry information across space from a transmitter to a receiver, by modulating the radio signal (impressing an information signal on the radio wave by varying some aspect of the wave) in the transmitter. In radar, used to locate and track objects like aircraft, ships, spacecraft a

## 2. Cleaning the Data

Libraries:
 - pandas
 - re
 - string

### SOME BASIC CLEANING TECHNIQUES:
    - Remove numerical values
    - Remove punctuation
    - Make text lower case
    - Remove common words with little meaning
    - Tokenize text

In [4]:
#we combine the list into a string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text


# Combine it!
data_combined = {key: [combine_text(value)] for (key, value) in data.items()}

In [5]:
import pandas as pd
import re
import string


# We can either keep it in dictionary format or put it into a pandas dataframe
pd.set_option('max_colwidth',150)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['transcript']
data_df = data_df.sort_index()
data_df

Unnamed: 0,transcript
Radio,Radio is the technology of signaling and communicating using radio waves.[1][2][3] Radio waves are electromagnetic waves of frequency between 30 h...
Telephone,\n A telephone is a telecommunications device that permits two or more users to conduct a conversation when they are too far apart to be heard dir...
Wheel,"\n In its primitive form, a wheel is a circular block of a hard and durable material at whose center has been bored a hole through which is placed..."


In [6]:

# Let's take a look at a fragment of the transcript for The Wheel
data_df.transcript.loc['Wheel'][:300]

'\n In its primitive form, a wheel is a circular block of a hard and durable material at whose center has been bored a hole through which is placed an axle bearing about which the wheel rotates when torque is applied to the wheel about its axis. The wheel and axle assembly can be considered one of the'

In [7]:
#FIRST ROUND OF THE CLEANNING PROCESS

def cleanning_process1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

text_cleaned1 = lambda x: cleanning_process1(x)

data_clean = pd.DataFrame(data_df.transcript.apply(text_cleaned1))
data_clean

Unnamed: 0,transcript
Radio,radio is the technology of signaling and communicating using radio waves radio waves are electromagnetic waves of frequency between hertz hz and ...
Telephone,\n a telephone is a telecommunications device that permits two or more users to conduct a conversation when they are too far apart to be heard dir...
Wheel,\n in its primitive form a wheel is a circular block of a hard and durable material at whose center has been bored a hole through which is placed ...


In [8]:
#SECOND ROUND OF THE CLEANNING PROCESS

def cleanning_process2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text)
    return text

text_cleaned2 = lambda x: cleanning_process2(x)


#LETS LOOK AT THE RESULT
data_clean = pd.DataFrame(data_clean.transcript.apply(text_cleaned2))
data_clean

Unnamed: 0,transcript
Radio,radio is the technology of signaling and communicating using radio waves radio waves are electromagnetic waves of frequency between hertz hz and ...
Telephone,a telephone is a telecommunications device that permits two or more users to conduct a conversation when they are too far apart to be heard direc...
Wheel,in its primitive form a wheel is a circular block of a hard and durable material at whose center has been bored a hole through which is placed an...


## 3. Organizing the data

We want to get from this last part a:
 - **Corpus**
 - **Document-Term Matrix**
 
### Corpus
 
 "In linguistics, a corpus (plural corpora) or text corpus is a language resource consisting of a large and structured set of texts (nowadays usually electronically stored and processed)." - Wikipedia

In [10]:
#Dataframe:

data_df

Unnamed: 0,transcript
Radio,Radio is the technology of signaling and communicating using radio waves.[1][2][3] Radio waves are electromagnetic waves of frequency between 30 h...
Telephone,\n A telephone is a telecommunications device that permits two or more users to conduct a conversation when they are too far apart to be heard dir...
Wheel,"\n In its primitive form, a wheel is a circular block of a hard and durable material at whose center has been bored a hole through which is placed..."


In [11]:
# We pickle it for later use
data_df.to_pickle("corpus.pkl")

### Document-Term Matrix

Library:
 - sklearn.feature_extraction.text (**CountVectorizer**)
 
 We will use CountVectorizer to create the matrix

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_clean.transcript)
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index
data_dtm

Unnamed: 0,ability,able,absence,absorbed,absorbs,ac,accelerating,acceleration,accomplished,accuracy,...,yanshi,years,yin,york,yvto,árokalja,édouard,κύκλος,τῆλε,φωνή
Radio,1,1,0,1,0,0,1,1,1,1,...,0,2,0,0,1,0,1,0,0,0
Telephone,2,0,1,0,0,5,0,0,1,0,...,0,0,0,1,0,0,0,0,2,2
Wheel,0,0,1,0,1,0,0,0,0,0,...,1,3,1,0,0,1,0,1,0,0


In [15]:
# Pickle it for later use
data_dtm.to_pickle("dtm.pkl")

In [16]:
# We pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('data_clean.pkl')
pickle.dump(cv, open("cv.pkl", "wb"))