### Cleaning text files with a focus on Mental Health in British India
In order to focus on the main keyword (or set of keywords), this notebook goes through the following steps:
- Open files and create a pandas dataframe.
- Filter rows based on specific set of keywords.
- Run a spell check and correct words. This will only be useful if it doesn't change specific words like location names.
- Lowercase words.
- Export dataframe to csv for future data exploration.

In [1]:
import os # File manipulation
import pandas as pd # For dataframe operations
import numpy as np
from collections import Counter # to count word occurance
import re # Regix to remove punctuation from strings I split
#from autocorrect import Speller # spell checker for an alternative df


# Import Libraries

# We need NLTK and Gensim for LDA Topic Modelling
from nltk import word_tokenize, pos_tag

# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')


import gensim
from gensim import matutils, models
from gensim import corpora

import scipy.sparse
import os # to access files for cleaning
from collections import Counter # to count word occurance
import re # Regix to remove punctuation from strings I split
from shutil import copyfile # For copying clean files
from sklearn.feature_extraction.text import CountVectorizer # For creating document-term matrix & excluding stop words
from sklearn.feature_extraction import text # For getting stop words
from wordcloud import WordCloud # For creating word clouds
from textblob import TextBlob # For sentiment analysis
import numpy as np # For dataframe analysis
import pandas as pd # For dataframe analysis
import matplotlib.pyplot as plt # For graphs
import seaborn as sns # For graphs
%matplotlib inline

In [2]:
# Modify this list to add keywords 🧚
keywords = ['lunatic', 'asylum', 'mental', 'hospital'] #think we should include 'mental' 'hospitals' here

In [3]:
pwd

'/Users/tamaralottering/Desktop/british-india-papers'

In [4]:
# walk through the /data folder and read text files to make a df
textList = []
for dirname, _, filenames in os.walk('/Users/tamaralottering/Desktop/nls-text-indiaPapers'):
    for filename in filenames:
        # print(os.path.join(dirname, filename))
        myfile = os.path.join(dirname, filename)
        with open(myfile, 'rb') as fopen:
            q = fopen.read().decode('ISO-8859-1')
            textList.append(q)
uncleanDf = pd.DataFrame(textList)
uncleanDf.columns = ['text']
uncleanDf

Unnamed: 0,text
0,IP/QB.10 m.91.b. No. 44. (NEW SERIES.) SCIENTI...
1,IP/6/HG.s4. REPORT ON THE CALCUTTA MEDICAL INS...
2,"CHOLERA IN INDIA, 1862 TO 1881. BENGAL PROVINC..."
3,Vol. I 1931 THE Indian Journal of Veterinary S...
4,"IP/QB, 10 m.91.b No. 19. (NEW SERIES.) SCIENTI..."
...,...
465,[NLS note: a graphic appears here - see image ...
466,REPORT ON THE WORKING OF THE MENTAL HOSPITALS ...
467,ICAR. 15. VIII. 650 Vol. VIII 1938 THE Indian ...
468,SLEEPING SICKNESS A SUMMARY OF THE WORK DONE B...


In [5]:
# Initial cleaning
def initialCleaning(mystring):
    mystring = mystring.lower() # Text normalization: make string lowercase
    mystring = re.sub(r'[^\w\s]','', mystring) # Text normalization: remove punctuation
    mystring = re.sub('\[.*?\]', '', mystring) #Text normalization: remove text in square brackets
    mystring = re.sub('https?://\S+|www\.\S+', '', mystring) #Text normalization: remove links
    mystring = re.sub('\n', '', mystring) #Text normalization: remove linebreaks
    mystring = re.sub('\w*\d\w*', '', mystring)#Text normalization: 
    return mystring

def countWords(string, wordsToCount):
    splitString = string.split() # Split string into array of words
    counts = Counter(splitString) # Get counts for each word like Counter({'dogs': 3, 'cute': 1})
    count = 0 # Start the counter
    for word in wordsToCount: # Loop through list of words and add the count
        count = count + counts[word]
    return count

In [6]:
cleanText = lambda text: initialCleaning(text) # Lambda function applies to all cells in a column

In [7]:
# Tokenize and Lemmatize the text (NLP standardisation methods)
import nltk

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

In [8]:
cleanDf = pd.DataFrame(uncleanDf.text.apply(cleanText)) # .apply() the function to all cells

In [9]:
cleanDf

Unnamed: 0,text
0,no new series scientific memoirs by officer...
1,report on the calcutta medical institutions f...
2,cholera in india to bengal province to and...
3,vol i the indian journal of veterinary scienc...
4,ipqb no new series scientific memoirs by of...
...,...
465,nls note a graphic appears here see image of ...
466,report on the working of the mental hospitals ...
467,icar viii vol viii the indian journal of ve...
468,sleeping sickness a summary of the work done b...


In [10]:
cleanerDfList = []
for index, row in cleanDf.iterrows():
    count = countWords(cleanDf['text'].iloc[index], keywords)
    if(count>4):
        cleanerDfList.append(cleanDf['text'].iloc[index])

In [11]:
cleanerDf = pd.DataFrame(cleanerDfList)
cleanerDf.columns = ['text']
cleanerDf

Unnamed: 0,text
0,report on the calcutta medical institutions f...
1,cholera in india to bengal province to and...
2,vol i the indian journal of veterinary scienc...
3,ipqb no new series scientific memoirs by of...
4,gtx \rsi the phi price\rstatistical returns o...
...,...
279,\rnls note a graphic appears here see image o...
280,nls note a graphic appears here see image of ...
281,nls note a graphic appears here see image of ...
282,report on the working of the mental hospitals ...


#### Cleaned Export
- cleanerDf refers to the clean dataframe that includes all documents that contain the main keywords with all text in lowercase and punctuation removed.

### Topic Modelling

- It is a process to automatically identify topics present, and to derive hidden patterns exhibited by a text corpus. 

- Topic Modelling is an unsupervised machine learning approach used for finding and observing the bunch of words (called “topics”) in large clusters of texts. 

- Topics can be defined as “a repeating pattern of co-occurring terms in a corpus”. A good topic model should result in – “health”, “doctor”, “patient”, “hospital” for a topic – Healthcare, and “farm”, “crops”, “wheat” for a topic – “Farming”.

- Topic Models are very useful for the purpose for document clustering, organizing large blocks of textual data, information retrieval from unstructured text and feature selection. 

Resources:
https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0

In [12]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tamaralottering/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/tamaralottering/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [13]:
# Tokenize corpus and Filter nouns and adjectives from the corpus (POS tagging)
def partsOfSpeechFilter(text):
    isNounAdj = lambda pos: pos[:2] == 'NN' or pos[:2] == 'JJ'
    tokenized = word_tokenize(text)
    nounsAdj = [word for (word, pos) in pos_tag(tokenized) if isNounAdj(pos)] 
    return ' '.join(nounsAdj)

#### 1. Preparing Document-Term Matrix

To run any mathematical model on text corpus, it is a good practice to convert it into a matrix representation. LDA model looks for repeating term patterns in the entire DT matrix.

- “gensim” is a clean and beautiful library to handle text data. It is scalable, robust and efficient.
- convert a corpus into a document-term matrix

In [14]:
# Apply parts of speech filter to filter out nouns and adjectives 
dfMH = pd.DataFrame(cleanerDf.text.apply(partsOfSpeechFilter)) 

In [15]:
# CountVectorizer is used to convert a collection of text documents to a vector of term/token counts.
# Get the document term matrix
vectorizer = CountVectorizer(stop_words='english')
dataVectorizer = vectorizer.fit_transform(dfMH.text) #initial fitting of parameters on the training set x
dataDtmNA = pd.DataFrame(dataVectorizer.toarray(), columns = vectorizer.get_feature_names())
dataDtmNA # This is the document term matrix

Unnamed: 0,__,___,____,_____,______,_______,________,_________,__________,___________,...,être,ídar,îstrus,óf,ôkpho,öocysts,öotype,ùpon,únmádmadness,über
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,12,5,5,5,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
279,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
280,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
281,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
282,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Map ID to words
corpusNA = matutils.Sparse2Corpus(scipy.sparse.csr_matrix(dataDtmNA.transpose()))
id2wordNA = dict((v, k) for k, v in vectorizer.vocabulary_.items())

#### 2. Running LDA Model

In [17]:
# Run the LDA model
lda = models.LdaModel(corpus=corpusNA, num_topics=5, id2word=id2wordNA, passes=80)
lda.print_topics()

[(0,
  '0.013*"plague" + 0.011*"cases" + 0.009*"hospital" + 0.009*"medical" + 0.006*"disease" + 0.006*"case" + 0.006*"deaths" + 0.006*"number" + 0.006*"officer" + 0.005*"district"'),
 (1,
  '0.039*"total" + 0.025*"year" + 0.016*"number" + 0.014*"males" + 0.013*"females" + 0.010*"vaccination" + 0.010*"statement" + 0.008*"average" + 0.008*"asylum" + 0.008*"years"'),
 (2,
  '0.007*"cent" + 0.006*"animals" + 0.005*"milk" + 0.005*"table" + 0.005*"disease" + 0.004*"animal" + 0.004*"blood" + 0.004*"indian" + 0.004*"cattle" + 0.004*"results"'),
 (3,
  '0.029*"ganja" + 0.019*"bhang" + 0.015*"use" + 0.013*"drugs" + 0.012*"charas" + 0.010*"hemp" + 0.007*"drug" + 0.007*"district" + 0.006*"people" + 0.006*"plant"'),
 (4,
  '0.028*"year" + 0.022*"veterinary" + 0.018*"total" + 0.018*"number" + 0.010*"rs" + 0.009*"district" + 0.009*"department" + 0.009*"animals" + 0.009*"government" + 0.008*"table"')]

### Topics - just with 'lunatic asylums' corpus

- mental asylum in Tezpur and Burma
- Use of ganja, bhang, charas by people in the district: ganja mania
- report of total deaths and cases of disease of patients
- males and females in lunatic asylums

### Topics -  with 'lunatic asylums'and 'mental health' corpus

- Use of ganja, bhang, charas by people in the district: ganja mania
- total vaccinations per year of males and females in asylums 

Unrelated
- number of plague cases and deaths in hospitals
- diseases in animals (e.g. cattle)
