## Data Intake, Cleaning, and Enrichment Steps
Below is a general summary of what was done for data intake, cleaning, and enrichment. Followed is the code used. **DO NOT** run any cells as some are configured to read or write files relative your current working directory and may override correct data files. This is because in reality, much of this code was contained in their own notebooks to enable staged progress and enable group feedback. As such, many cells below are designed to create temporary files as we discussed changes and needs.

**Person Details:** https://dreambank.net/grid.cgi

![details](img/details.png)

1. download the html into a tree based data structure
2. remove elements that arent apart of the table
3. iterate through the elements in the table
4. build a dataframe with the elements and a bit of cleaning
5. enrich with desirable columns and associations
6. save to csv

![details_fin](img/details_fin.png)

**Dream Journals** https://dreambank.net/random_sample.cgi

![dreams](img/dreams.png)

1. got control of a browser with python
2. iteratively
    * select each person in the drop down (the text contains the number of samples)
    * copy that persons number of samples (the amount of dreams recorded)
    * paste it in how many dreams to find (meaning all of them)
    * clear the minimum and maximum word counts (to get all words)
    * press search

![diaries](img/diaries.png)

2. cont.
    * copy all content into a string
    * split into 2 arrays of strings (number and date) (dream content)
    * combine into a dataframe
    * cleaned it
    * enrich with the VADER sentiment
    * repeat for all 94 people
3. save dataframes to csv's

![files](img/files.png)

![struct](img/struct.png)

**Helper Datasets:** positive, neutral, negative, and compound descriptive summaries
1. Iteratively apply the pandas.describe() method to subsets of the diaries.
2. Accumulate into a dataframe.
3. Add diary association.
4. Write to csv
5. Repeat for each sentiment type.

`neutral_summary.csv` (example below), `positive_summary.csv`, `negative_summary.csv`, `compound_summary.csv`

![summary](img/summary.png)

---

## Initial Intake Notebook

Detailed description of each person:

https://dreambank.net/grid.cgi

Some ways this data has been used:

https://dreams.ucsc.edu/Library/domhoff_2008c.html

Goal: Get all of the available dreams from dreambank.net
Plan:
- [ ] Scrape the data from https://dreambank.net/random_sample.cgi
- [ ] Scrape the descriptions from https://dreambank.net/grid.cgi
- [ ] Compile into a single csv if possible.
- [ ] Construct a useable data dictionary.


#### Stage 1 Scrape:
1. go to https://dreambank.net/random_sample.cgi
2. find __select__ tag with __name="series"__ and __id="select:series"__ attributes.
3. for each __option__ tag within the select tag, __focus__ each option and get
    1. sample size, formatted as `[n=321]`
4. find __input__ tag with __name="min"__ attribute and clear it
5. find __input__ tag with __name="max"__ attribute and clear it
6. find __input__ tag with __name="n"__ attribute and replace with the sample size found from above.
7. find __input__ tag with __type="submit"__ and __value="Search"__ attributes and submit.
8. copy everything within the body tag into its own text file.

In [None]:
# !pip3 install selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

import numpy as np
import pandas as pd

import re

In [None]:
url = "https://dreambank.net/random_sample.cgi"
driver = webdriver.Chrome()
driver.get(url)

In [None]:
number_of_options = 95
for option_number in range(1, number_of_options):
    dropdown = Select(driver.find_element(By.ID, 'select:series'))
    dropdown.select_by_index(option_number)
    n = dropdown.options[option_number].text.split('=')[1][:-1]
    driver.find_element(By.NAME, 'min').clear()
    driver.find_element(By.NAME, 'max').clear()
    driver.find_element(By.NAME, 'n').clear()
    driver.find_element(By.NAME, 'n').send_keys(n)
    driver.find_element(By.TAG_NAME, 'input').submit()
    with open(f'raw_diary_files/{option_number}.txt', 'w') as file:
        file.write(driver.find_element(By.TAG_NAME, 'body').text)
    driver.back()

#### Stage 2 Scrape

https://dreambank.net/grid.cgi

Initial Format:

name, sex, years, number_of_dreams, description

selectors to use:
* table tag, columns 0, 3, 4, 5

Steps Used:
1. manually copied and pasted descriptions and table into two seperate files.
2.

In [None]:
lines = []
with open('raw_table.txt', 'r') as file:
    raw_lines = file.readlines()
    for raw_line in raw_lines:
        lines.append([
            item
            for item in raw_line.split('\t')
            if item != '[info]'
            if item != ''
        ])

In [None]:
lines = [
    '|'.join(item).lower()
    for item in lines
]

In [None]:
with open('almost.txt', 'w') as file:
    file.writelines(lines)

#### make csv of person(s) | sex | year(s) ...txt

In [None]:
lines = []
with open('person(s) | sex | year(s) | dream_count.txt', 'r') as f_in:
    lines.append('person(s)|sex|year(s)|entry_count\n')
    for line in f_in.readlines():
        lines.append(line)

lines = [
    line.replace(',', '')
    for line in lines
]

lines = [
    line.replace('|', ',')
    for line in lines
]

with open('some_diary_details.csv', 'w') as f_out:
    f_out.write(''.join(lines))

Ended up doing a few manual edits afterwords (an extra ',' on a few lines).

#### Now the main content

In [None]:
for i in range(1, 95):
    with open(f'raw_diary_files/{i}.txt', 'r') as file:
        lines = file.readlines()
        text = ''.join(lines[3:-1])
        # Splitting the text by the regex pattern
        parts = re.split(r'\(\d+ words\)\n', text)

        entry_headers = []
        entry_content = []
        for part in parts:
            entry_lines = part.split('\n')
            entry_headers.append(entry_lines[0])
            entry_content.append(''.join(entry_lines[2:]))

        combined_lines = [
            f'{head}|{content}'
            for head, content in zip(entry_headers, entry_content)
        ]

        combined_lines.insert(0, 'raw_number|content')

        with open(f'almost_clean/diary{"0" + str(i) if i < 10 else str(i)}.csv', 'w') as f:
            f.write('\n'.join(combined_lines))

---

## Person Details Notebook

In [None]:
with open('./raw/raw_descriptions.txt', 'r') as f:
    content = f.read()

content = content.split('[top]\n')[1:]
content[:5]

In [None]:
persons = [
    person.split('\n')[0].lower().replace(',', '')
    for person in content
]

persons[:5]

In [None]:
# why 90 ???
len(persons)

In [None]:
descriptions = [
    ''.join(description.split('\n')[1:])
    for description in content
]

descriptions[:5]

In [None]:
# why 90 ???
len(descriptions)

In [None]:
missing_people = [
    'izzy age 18-21',
    'izzy age 22-25',
    'sally: a forester',
    'van: a video gamer'
]
ref_people = [
    'izzy age 17',
    'izzy age 18-21',
    'robert bosnak: a dream analyst',
    'ucsc women 1996'
]

In [None]:
for i, ref_person in enumerate(ref_people):
    insert_point = persons.index(ref_person) + 1
    persons.insert(insert_point, missing_people[i])
    descriptions.insert(insert_point, 'na')

In [None]:
import pandas as pd

In [None]:
temp = pd.DataFrame({
    'person': persons,
    'description': descriptions
})
temp.shape

In [None]:
temp.sample(5)

In [None]:
df = pd.read_csv('person_details.csv')
df = df.merge(temp, on='person', how='inner')
df.sample(5)

In [None]:
df = df[['id', 'diary_ref', 'person', 'description', 'sex', 'sex_code', 'year', 'entry_count']]
df.sample(5)

In [None]:
df.to_csv('person_details.csv')

#### Explorations as to why there were only 90 people/descriptions, but there should be 94 entries.

In [None]:
with open('./raw/raw_descriptions.txt', 'r') as f:
    test = f.read()

len(test.split('[top]'))

It appears these people do not have descriptions (based on the no [info] link at https://dreambank.net/grid.cgi)

    Izzy, age 18-21
    Izzy, age 22-25
    Sally: a forester
    Van: a video gamer



remember, the persons list has been lowered.

In [None]:
# find the above entries from the persons list -> get the index -> -1 from them -> insert after this point

In [None]:
missing_lst = [
    'izzy age 18-21',
    'izzy age 22-25',
    'sally: a forester',
    'van: a video gamer'
]

In [None]:
indices = []
for i, person in enumerate(persons):
    if person in missing_lst:
    indices.append(i)

Obviously their not in there...

check the persons_details.csv

In [None]:
import pandas as pd

In [None]:
temp = pd.read_csv('person_details.csv')
temp.head(5)

In [None]:
indices = [
    (i, person)
    for i, person in enumerate(temp['person'])
    if person in missing_lst
]
indices

In [None]:
temp[ temp['person'].str.contains('izzy') ]

There it is! I forgot to remove the ,

Kinda dumb...

In [None]:
missing_culprets_indice = [
    (40, 'izzy age 18-21'),
    (41, 'izzy age 22-25'),
    (79, 'sally: a forester'),
    (87, 'van: a video gamer'),
]

#### Phase 2 of Cleaning

initial data documentation


person_details.csv

__id__ : is the id of the person.

__diary_ref__ : a reference to the persons diary file.

__person__ : persons name, and/or a short description of them, or even their association with someone else...

- [ ] maybe figure out a way to extract only the name. NOTE: that for some entries there appears to be associations to other people, for example "johns wife" as well as general descriptions such as "8th grader". these raise the complexity and will require extensive, if not manual, extraction.
- [ ] try to integrate the lengthier description of the person

__sex__ : female or male

__sex_code__ : female = 1, male = 0

- [ ] do this transformation

__year__ : the date or range of dates that the person recordered their dreams.

- [ ] maybe split into start_date and end_date. this will might resolve the instances of 1995-1996, 2018, and mid 1990s. although with the range of possible formats present, this too will take time.
* below is every unique format
```
?
1990s
2007-
2015
2016-2017
1940s-1950s & 1990s
mid-1990s
late 1990s
1940s-1950s
```
__entry_count__ : the number of entries that person recorded.

In [None]:
df = pd.read_csv('almost_clean_diary_details.csv')
df.sample(5)

In [None]:
df['diary_ref'] = [
    f'diary{"0" + str(i) if i < 10 else str(i)}'
    for i in range(1, df.shape[0] + 1)
]
df.sample(5)

In [None]:
df['id'] = [
    i
    for i in range(1, df.shape[0] + 1)
]
df.sample(5)

In [None]:
df = df.set_index('id')
df.sample(5)

In [None]:
df = df.rename(columns={
    'person(s)': 'person',
    'year(s)': 'year',
})
df.sample(5)

In [None]:
df = df[['diary_ref', 'person', 'sex', 'year', 'entry_count']]
df.sample(5)

In [None]:
df['sex_code'] = df['sex'].apply(lambda x: 1 if x == 'female' else 0)
df.sample(5)

In [None]:
df = df[['diary_ref', 'person', 'sex', 'sex_code', 'year', 'entry_count']]
df.sample(5)

In [None]:
df['diary_ref'] = df['diary_ref'].apply(lambda x: f'{x}.csv')
df.sample(5)

In [None]:
df.to_csv('person_details.csv')

---

## Person Diary Notebook

The files for this stage became corrupt and are lost now :( not sure why, or how...

---

## VADER Enrichment Notebook

In [None]:
analyzer = SentimentIntensityAnalyzer()
# analyzer.polarity_scores?

In [None]:
diary_file_names = [
    f"diary{'0' + str(i) if i < 10 else str(i)}.csv"
    for i in range(1, 95)
]
diary_file_names[:5]

For some unknown reason `diary32.csv`, `diary33.csv`, and `diary41.csv` were unable to fully be read in. Thus, a few days (of several thousands) will be dropped. This should not affect overall evaluation. If time permits, I will investigate further.

In [None]:
# dropping last row because they are nan's.
diary_dfs = [
    pd.read_csv(f'diaries/{diary_file_name}', sep='|', on_bad_lines='skip')[:-1]
    for diary_file_name in diary_file_names
]

In [None]:
for df in diary_dfs:
    score = np.array([
        analyzer.polarity_scores(text)
        for text in df['content'].values
    ])
    
    df['negative'] = np.array([
        d['neg']
        for d in score
    ])

    df['neutral'] = np.array([
        d['neu']
        for d in score
    ])

    df['positive'] = np.array([
        d['pos']
        for d in score
    ])

    df['compound'] = np.array([
        d['compound']
        for d in score
    ])

In [None]:
diary_dfs[0]

In [None]:
diary_file_names[0]

In [None]:
for i, df in enumerate(diary_dfs):
    df.to_csv(f'diaries/{diary_file_names[i]}', sep='|')

---

## Summary Notebook

In [None]:
diary_paths = [
    f'diary{"0" + str(i) if i < 10 else str(i)}.csv'
    for i in range(1, 95)
]
diary_paths[:5]

In [None]:
dfs = []

for path in diary_paths:
    dfs.append(pd.read_csv(f'data/diaries/{path}', sep='|'))
    
dfs[:5]

In [None]:
summs = []

for df in dfs:
    summs.append(df.describe())

In [None]:
summs[0][['negative', 'neutral', 'positive', 'compound']].T

In [None]:
summs[0][['negative']].T.loc[['negative']]

In [None]:
neg = pd.DataFrame({
    'count': [],
    'mean': [],
    'std': [],
    'min': [],
    '25%': [],
    '50%': [],
    '75%': [],
    'max': []
})
neu = pd.DataFrame({
    'count': [],
    'mean': [],
    'std': [],
    'min': [],
    '25%': [],
    '50%': [],
    '75%': [],
    'max': []
})
pos = pd.DataFrame({
    'count': [],
    'mean': [],
    'std': [],
    'min': [],
    '25%': [],
    '50%': [],
    '75%': [],
    'max': []
})
comp = pd.DataFrame({
    'count': [],
    'mean': [],
    'std': [],
    'min': [],
    '25%': [],
    '50%': [],
    '75%': [],
    'max': []
})

In [None]:
for i in range(len(summs)):
    temp = summs[i][['negative', 'neutral', 'positive', 'compound']].T
    neg.iloc[i] = temp.loc['negative']
    neu.iloc[i] = temp.loc['neutral']
    pos.iloc[i] = temp.loc['positive']
    comp.iloc[i] = temp.loc['compound']

In [None]:
neg['diary'] = np.arange(1, 95)
neg = neg.set_index('diary')
neg

In [None]:
neg.to_csv('data/negative_summary.csv')

In [None]:
neu['diary'] = np.arange(1, 95)
neu = neu.set_index('diary')
neu

In [None]:
neu.to_csv('data/neutral_summary.csv')

In [None]:
pos['diary'] = np.arange(1, 95)
pos = pos.set_index('diary')
pos

In [None]:
pos.to_csv('data/positive_summary.csv')

In [None]:
comp['diary'] = np.arange(1, 95)
comp = comp.set_index('diary')
comp

In [None]:
comp.to_csv('data/compound_summary.csv')

---