# HW3: Collecting, Exploring and Cleaning Data

## Scraping

1. Consider the following html snippet:

```
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">Here is an image of Elsie: <a href="http://example.com/elsie"><img src = "elsie.png"></a></p>
<p class="story">And here is an image of Lacie: <a href="http://example.com/lacie"><img src = "lacie.png"></a></p>
"""
```

We would like to get a list of the three little sisters names. Fill in the missing line of the following function:

```
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

def get_sisters_names(soup):
    # missing line 1 #
    return [sister.get_text() for sister in sisters]
```

a. `sisters = soup.find_all('sister')`

b. `sisters = soup.find_all('a')`

c. `sisters = soup.find_all('a', {'class': 'sister'})`

d. `sisters = soup.find_all(class = 'sister')`

<br>

2. Now we would like to get only the raw text of the Doormouse story, not including the title. Fill in the missing line of the following function:

```
def get_story(soup):
    # missing line 1 #
    story = ''
    for p in paragraphs:
        # missing line 2 #
        if class_name == 'story':
            story += p.get_text() + ' '
    return ' '.join(story.split())
```

a. line 1: `paragraphs = soup.find_all('p')`; line 2: `class_name = p['class'][0]`

b. line 1: `paragraphs = soup.find_all('p', {'class': 'story'})`; line 2: `class_name = p['class']`

c. line 1: `paragraphs = soup.find_all('p', {'class': 'story'})`; line 2: `class_name = p.find('class')[0]`

d. line 1: `paragraphs = soup.find_all('story')`; line 2: `class_name = p.attr['class']`

<br>

3. What will be the result calling the following function on our `soup` object (i.e. `does_something(soup)`)?

```
def does_something(soup):
    return [e['src'].split('.')[0] for e in soup.find_all('img')]
```

a. `['elsie.png', 'lacie.png']`

b. `['elsie', 'lacie']`

c. `{'src': ['elsie', 'lacie']}`

d. The function will throw an exception on our `soup` object

<br>

## Summary Statistics

4. `x` is a 1D numerical vector of length 100 and variance greater than 0. Which of the following statements is incorrect?

a. 50% of the distribution of `x` lies within the IQR of `x` (i.e. between the 25th and 75th percentiles)

b. The mean of `x` is greater than or equal to the median of `x`

c. The standard deviation of `x` can be equal to its variance

d. The kurtosis of `x`, *as defined in class*, is greater than 0

<br>

5. Consider the following function:

```
def interesting_statistic(x):
    stat1 = np.percentile(x, 75) - np.percentile(x, 25)
    stat2 = np.max(x) - np.min(x)
    return stat1 / stat2
```

For which of the following inputs will this function return `1`? (assume `import numpy as np`)

a. `[1, 1, 1, 1, 1]`

b. `np.random.randn(1000)`

c. `[1]`

d. The function will never return 1

## Basic Plots

6. Fill in the missing line to produce two boxplots of `y`, side by side, one for each value of `x`:

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set()

df = pd.DataFrame({
    'x': np.concatenate([np.tile('a', 100), np.tile('b', 100)]),
    'y': np.concatenate([np.random.normal(size = 100), np.random.exponential(size = 100)])})

# missing line 1 #
```

a. `sns.boxplot(df.x, df.y)`

b. `sns.boxplot(df['x'], df['y'])`

c. `sns.boxplot('x', 'y', data = df)`

d. All of the above

<br>

## Cleaning Data

7. Which one of the following data tables is "tidy"?

a. 
```
pd.DataFrame({
    'student': np.concatenate([np.tile('John', 5), np.tile('Paul', 5)]),
    'before': np.random.randint(10, size = 10),
    'after': np.random.randint(10, size = 10)
})
```

b. 
```
pd.DataFrame({
    'time': np.concatenate([np.tile('before', 5), np.tile('after', 5)]),
    'John': np.random.randint(10, size = 10),
    'Paul': np.random.randint(10, size = 10)
})
```

c.
```
pd.DataFrame({
    'student': np.concatenate([np.tile('John', 10), np.tile('Paul', 10)]),
    'time': np.tile(np.concatenate([np.tile('before', 5), np.tile('after', 5)]), 2),
    'measurement': np.random.randint(10, size = 20)
})
```

d.
```
pd.DataFrame({
    'student': np.concatenate([np.tile('John', 5), np.tile('Paul', 5)]),
    'time': np.concatenate([np.tile('before', 5), np.tile('after', 5)]),
    'measurement': np.random.randint(10, size = 10)
})
```

<br>

8. You've stumbled upon a dataset holding the monthly expenditure for 10000 random Amazon users, two months per user:  a random month in the year prior to releasing CoolNewFeature (e.g. One-Day Delivery), and a random month in the year post the release of CoolNewFeature. You wish to test the hypothesis that the CoolNewFeature has overall increased the average monthly expenditure. And we'll ignore for now the question of Control (do you see the problem?). You suspect the distribution of monthly expenditure is highly NOT symmetrical, let alone "normal" and it would not lend itself nicely for standard statistical tests, e.g. the T test. Which of the following is correct?

a. The log transformation may be a proper transformation to use on your data before trying the T-test.

b. Perform the T-test with and without the log transformation. If both tests agree on the result, report this result.

c. With n = 10000 any statstical test is robust against outliers.

d. The median may be a better statistic to estimate the central location of the expenditure distribution, but only after ignoring the many possible zeros in such data.