## Cleaning text files

In order to focus on the main keyword (or set of keywords), this notebook goes through the following steps:

1. Open files and create a pandas dataframe.
2. Filter rows based on specific set of keywords.
3. Run a spell check and correct words. This will only be useful if it doesn't change specific words like location names.
4. Lowercase words.
5. Export dataframe to csv for future data exploration.

In [20]:
import os # File manipulation
import pandas as pd # For dataframe operations
import numpy as np
from collections import Counter # to count word occurance
import re # Regix to remove punctuation from strings I split
from autocorrect import Speller # spell checker for an alternative df

In [17]:
# Modify this list to add keywords 🧚
keywords = ['lunatic', 'asylum']

In [3]:
# walk through the /data folder and read text files to make a df
textList = []
for dirname, _, filenames in os.walk('./data'):
    for filename in filenames:
        # print(os.path.join(dirname, filename))
        myfile = os.path.join(dirname, filename)
        with open(myfile, 'rb') as fopen:
            q = fopen.read().decode('ISO-8859-1')
            textList.append(q)
uncleanDf = pd.DataFrame(textList)
uncleanDf.columns = ['text']
uncleanDf

Unnamed: 0,text
0,IP/QB.10 m.91.b. No. 44. (NEW SERIES.) SCIENTI...
1,IP/6/HG.s4. REPORT ON THE CALCUTTA MEDICAL INS...
2,"CHOLERA IN INDIA, 1862 TO 1881. BENGAL PROVINC..."
3,Vol. I 1931 THE Indian Journal of Veterinary S...
4,"IP/QB, 10 m.91.b No. 19. (NEW SERIES.) SCIENTI..."
...,...
463,[NLS note: a graphic appears here - see image ...
464,REPORT ON THE WORKING OF THE MENTAL HOSPITALS ...
465,ICAR. 15. VIII. 650 Vol. VIII 1938 THE Indian ...
466,SLEEPING SICKNESS A SUMMARY OF THE WORK DONE B...


In [18]:
# Initial cleaning
def initialCleaning(mystring):
    mystring = mystring.lower() # Text normalization: make string lowercase
    mystring = re.sub(r'[^\w\s]','', mystring) # Text normalization: remove punctuation
    return mystring

def countWords(string, wordsToCount):
    splitString = string.split() # Split string into array of words
    counts = Counter(splitString) # Get counts for each word like Counter({'dogs': 3, 'cute': 1})
    count = 0 # Start the counter
    for word in wordsToCount: # Loop through list of words and add the count
        count = count + counts[word]
    return count

In [13]:
cleanText = lambda text: initialCleaning(text) # Lambda function applies to all cells in a column
cleanDf = pd.DataFrame(uncleanDf.text.apply(cleanText)) # .apply() the function to all cells

In [14]:
cleanDf

Unnamed: 0,text
0,ipqb10 m91b no 44 new series scientific memoir...
1,ip6hgs4 report on the calcutta medical institu...
2,cholera in india 1862 to 1881 bengal province ...
3,vol i 1931 the indian journal of veterinary sc...
4,ipqb 10 m91b no 19 new series scientific memoi...
...,...
463,nls note a graphic appears here see image of ...
464,report on the working of the mental hospitals ...
465,icar 15 viii 650 vol viii 1938 the indian jour...
466,sleeping sickness a summary of the work done b...


In [31]:
cleanerDfList = []
for index, row in cleanDf.iterrows():
    count = countWords(cleanDf['text'].iloc[index], keywords)
    if(count>4):
        cleanerDfList.append(cleanDf['text'].iloc[index])

In [34]:
cleanerDf = pd.DataFrame(cleanerDfList)
cleanerDf.columns = ['text']
cleanerDf

Unnamed: 0,text
0,gtx \r\nsi the 110 phi price\r\nstatistical re...
1,triennial report of the lunatic asylums under ...
2,report on the lunatic asylums under the govern...
3,indian hemp drugs commission vol vi evidence o...
4,leprosy and its control in the bombay presiden...
...,...
60,9 no 1053 a proceedings of the honble the li...
61,report on the working of the micro biological ...
62,annual administration and progress report on t...
63,annual administration and progress report on t...


In [44]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.5.6-py2.py3-none-any.whl (2.5 MB)
[K     |████████████████████████████████| 2.5 MB 4.6 MB/s eta 0:00:01
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.5.6


In [50]:
spell = Speller(lang="en")
WORD = re.compile(r'\w+')

def reTokenize(doc):
    tokens = WORD.findall(doc)
    return tokens

def spell_correct(text):
    return ' '.join([spell(w).lower() for w in reTokenize(text)])

In [None]:
scText = lambda text: spell_correct(text) # Lambda function applies to all cells in a column
scDf = pd.DataFrame(cleanerDf.text.apply(scText)) # .apply() the function to all cells

In [None]:
scDf

## Cleaned Exports

We have two cleaned dataframes here:

1. `cleanerDf` refers to the clean dataframe that includes all documents that contain the main keywords with all text in lowercase and punctuation removed.
2. `scDf` is same as `cleanerDf` but in this case the spell checker has fixed spelling mistakes. NOTE: USE WITH CAUTION. THIS DF MIGHT HAVE WRONG LOCATION NAMES AND LOCAL TERMINOLOGY.

In [None]:
scDf.to_csv('df-spellcorrected.csv') # Not recommended for use  but we can use this when words don't make sense 🧚
cleanerDf.to_csv('df.csv')