# Final Project

## Framing

**Introduction**: describe your dataset, and why you're interested in it

**Research question(s)**: describe the overall research question of your project

**Hypotheses**:
    * Describe 2-3 hypotheses that you're planning to test with your dataset
    * Each hypotheses should be based on academic research (cite a paper) and/or background knowledge that you have about the dataset if you've collected it yourself (e.g., if you've conducted interviews)
        *
    * Each hypotheses should be formulated as an affirmation (and not a question)
    * You can also describe alternative hypotheses, if you think that your results could go either way (but again, have a rationale as for why)

**Results**:
    * how are you planning to test each hypothesis? What models are you thinking of using?
    * what are the best results you can hope for? Is that interesting / relevant for other researchers?
    * what are implications of your potential findings for practioners?

**Threads**
    * Describe issues that might arise during the analyses above
    * Come up with backup plans in case you run into theses issues

## Data Exploration

Describe your raw data below; provide definition / explanations for the measures you're using

## Data Cleaning

Clean you data in this section, and make sure it's ready to be analyzed for next week!

### Step 1 - Data Retrieval

In [34]:
import csv
import os
import shutil

In [35]:
# creates a folder for all text files

path = '/Users/erincarvalho/Desktop/dev/final-project-Erin-c'
if os.path.isdir(path + '/txt_files'):
    shutil.rmtree(path + '/txt_files', ignore_errors=False, onerror=None)
os.mkdir(path + '/txt_files')

In [36]:
# creates a separate file for each topic with all posts and replies

ids = []

with open('ScratchEd_all_data.csv', "r", encoding='utf-8', errors='ignore') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    count = 0
    
    for idx, row in enumerate(csv_reader):   
        if str(row[0]) in ids:
            filename = path + '/txt_files/topic_' + str(row[0]) + '.txt'
            file = open(filename,'a+')
            contents = str(row[3]) + '\r\n' + '\r\n'
            file.write(contents)
        else:
            filename = path + '/txt_files/topic_' + str(row[0]) + '.txt'
            file = open(filename,'a+')
            contents = str(row[3]) + '\r\n' + '\r\n'
            file.write(contents)
            ids.append(str(row[0]))
            count += 1
    print(count)
    print(len(ids))

1444
1444


In [37]:
import glob

# save all the text files in a list

threads = glob.glob('./txt_files/*.txt')
print(len(threads))

1444


In [38]:
documents = []

# load actual text into a list
for thread in threads: 
    with open (thread, "r", encoding='utf-8', errors='ignore') as t:
        documents.append(t.read())
# convert text to all lowercase
for i, t in enumerate(threads):
    documents[i] = documents[i].lower()


### Step 2 - Data Cleaning

In [40]:
# removes new lines and carriage returns

documents = [doc.replace('\n', ' ') for doc in documents]
documents = [doc.replace('\r', ' ') for doc in documents]

may 2015 cambridge scratch educator meetup attende
does anyone know of an online course for credit ab


In [42]:
punctuation = ['.', '...', '!', '#', '"', '%', '$', "'", '&', ')', 
               '(', '+', '*', '-', ',', '/', '.', ';', ':', '=', 
               '<', '?', '>', '@', '",', '".', '[', ']', '\\', ',',
               '_', '^', '`', '{', '}', '|', '~', '−', '”', '“', '’']


# removes punctuation

for i,doc in enumerate(documents): 
    for punc in punctuation: 
        doc = doc.replace(punc, ' ')
    documents[i] = doc
    
print(documents[0][:100])

may 2015 cambridge scratch educator meetup attendees  andrea blake and family steven connelly janet 


In [43]:
# removes numbers

for i,doc in enumerate(documents): 
    for num in range(10):
        doc = doc.replace(str(num), '')
    documents[i] = doc

print(documents[0][:100])

may  cambridge scratch educator meetup attendees  andrea blake and family steven connelly janet dee 


In [44]:
stop_words = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 
              'ourselves', 'you', 'your', 'yours', 'yourself', 
              'yourselves', 'he', 'him', 'his', 'himself', 'she', 
              'her', 'hers', 'herself', 'it', 'its', 'itself', 
              'they', 'them', 'their', 'theirs', 'themselves', 
              'what', 'which', 'who', 'whom', 'this', 'that', 
              'these', 'those', 'am', 'is', 'are', 'was', 'were', 
              'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 
              'but', 'if', 'or', 'because', 'as', 'until', 'while', 
              'of', 'at', 'by', 'for', 'with', 'about', 'against', 
              'between', 'into', 'through', 'during', 'before', 
              'after', 'above', 'below', 'to', 'from', 'up', 'down', 
              'in', 'out', 'on', 'off', 'over', 'under', 'again', 
              'further', 'then', 'once', 'here', 'there', 'when', 
              'where', 'why', 'how', 'all', 'any', 'both', 'each', 
              'few', 'more', 'most', 'other', 'some', 'such', 'no', 
              'nor', 'not', 'only', 'own', 'same', 'so', 'than', 
              'too', 'very', 's', 't', 'can', 'will', 
              'just', 'don', 'should', 'now']


# removes stop words
for i,doc in enumerate(documents):
    for stop_word in stop_words:
        doc = doc.replace(' ' + stop_word + ' ', ' ')
    documents[i] = doc

print(documents[0][:100])

may  cambridge scratch educator meetup attendees  andrea blake family steven connelly janet dee jing


In [None]:
# removes words with one and two characters (e.g., 'd', 'er', etc.)

for i,doc in enumerate(documents):  
    doc = [x for x in doc.split() if len(x) > 2]
    doc = " ".join(doc)
    documents[i] = doc

print(documents[0][:100])