# General Instructions
1 - Start by downloading this jupyter notebook to your local machine

2 - Open a tab in your browser and type https://colab.research.google.com/

3 - This will open a small window. Choose the last option on the upper menu, "Upload". Then choose the jupyter notebook you have saved in step 1

4 - You can start working on your assignment by answering the questions in the corresponding cells.

5 - If you have any questions , please reach out to your instructor and TAs


# Statistics- Variables Assignment
# Introduction to Variables Location Based Assignment

This assignment is a location based-assignment that will require you to interact with the city around in you in a new way. Simply put, the objective is to measure a variable. You will identify a measurable variable in the city and then create an estimate using the Fermi estimation technique. Next, you will complete the data collection, calculate descriptive statistics on the data, and create relevant data visualizations. You will also have a chance to apply your knowledge of probability and simulation to solve a problem.
This is an individual assignment. Everything you submit should be your own words and reflect your own understanding of the material.

**NOTES:**

Anything marked as optional will only be scored if it is completed correctly. You must upload two files:
* **Primary Resource**: A PDF of your entire assignment. Run all cells before converting the notebook to a PDF, and double check to make sure that the PDF is complete with all sections visible. Email attachments will not be accepted. If you’re having difficulty converting your notebook to a PDF, try the tips available [here](https://docs.google.com/document/d/15dX89FOEoVEPuUNhY3PDloJ2qmQ63RQ_5l51zMEAlYQ/edit?usp=sharing)

* **Secondary Resource**: A zipped folder containing the .ipynb file and your original photo files.


## PART 1: VARIABLE SELECTION [#variables]

Select a neighborhood within a 10 minute walk of where you live. Visit this neighborhood and spend at least 30 minutes exploring the neighborhood to find your variable.

Important notes:
* The variable must be something that can be measured at different locations in the city. You need to make at least 10 different measurements of this variable, one for each location. The locations must be at least 100 meters away from each other.
* You must be able to calculate the mean, median, mode, and standard deviation of the variable.
* Be clear about your choice of locations to make the variable measurements.
* Get creative! Try to choose an interesting and informative variable and make sure to justify why the variable you have chosen is interesting.


**1. Define and operationalize your variable here.** 

Describe how you selected your variable. Specifically identify the type of variable, and whether you will be measuring a total, proportion, or average. Also identify the units it will be measured in and explain in detail how you will measure it. Make sure that your explanation is clear enough that another student would understand how to make the same measurement. Give the address of the 10 or more locations where you will conduct your measurement and provide an image that clearly identifies these locations on a map. (<150 words)


**2. Discuss variable relationships.**

* **2.1** (<150 words)
     - A. Describe a scenario in which your variable could be an independent variable. 
     - B. What could be the dependent variable(s)? 
     - C. What are some possible extraneous or confounding variables in this scenario? 

* **2.2** (< 150 words)
    * A. Describe a scenario in which your variable could be a dependent variable. 
    * B. What could be the independent variable(s)? 
    * C. What are some possible extraneous or confounding variables in this scenario? 

## PART 2: ESTIMATION AND MEASUREMENT [#variables]

**Important note:** *if there is any reason to believe that you did not authentically complete the location based portion of this assignment, this will be refered to the Academic Committee, and you risk receiving zeros in all your grades (as per the course policy in the syllabus). Please follow the instructions here carefully and include the original photo files in the zip folder along with the ipynb.*

1. Go to a Cafe in the neighborhood of your choice to produce a Fermi estimate of your variable. Use a napkin at a cafe to begin your Fermi estimate. You may not (yet) make any measurements. Your estimate should aim to involve at least 5 steps where you compute intermediate values. You will have to describe each step clearly, show your work, state any assumptions you’re making, and discuss whether your answer seems plausible (but it’s not necessary to do so on the napkin; see step 4 below).
2. Take some photos to document this experience. You must include:
    * A photo just of your “back of the napkin” estimate (it can and should be quite rough at this point). You will properly format the calculation later.
    * A selfie in the cafe in which you constructed your Fermi estimate. Clearly show your face, your Fermi estimate, and some of the interior of the cafe.
    * A selfie outside of the cafe showing your face and the exterior of the cafe, including the name. Bonus points if you are also holding your completed Fermi estimate in the photo too.
3. Typeset your full estimation in the Python notebook. Here, be sure to clearly explain all steps, justify all assumptions, and comment on whether the answer seems plausible.
4. It’s time to collect your data! Once again, take some photos to document your experience. Include at least two photos of your variable collection process. At least one photo should include your face and the variable you are counting.

Follow the instructions in this [link](https://docs.google.com/document/d/1OTLUMXG8NWzJkgZ5MtpjHpB45gC1IAzyofaC5_-WwPg/edit?usp=sharing\) to upload your pictures to the jupyter notebook:


In [1]:
#Answer Q3 here

In [2]:
#Upload your images here using the instructions in the link mentioned above.

## PART 3: ANALYSIS

1. Analyze the data in Python [#algorithms]:
    - **1.1** Use any method to import your collected data into Python. You can simply type the data directly into a Python list or numpy array. Or, you can put the data in a Google sheet, export to a .cvs file, and import into Python. Print your data in the cell below.
   

 * **1.2** Using Python, calculate the mean, median, mode, range, and standard deviation of your variable. Print these values. If you use a library function, you need to explain how it works with detailed comments. Do not blindly use library functions!

**Note**: Round your final answers up to 2 decimals.

In [3]:
def mean(my_data):
    """
    calculates the mean of your data
    
    Input:
    my_data: an array of numbers (floats)
    
    Output: a float that represents the mean of your data
    """
    ### BEGIN SOLUTION
    return round(np.mean(my_data),2)
    ### END SOLUTION

In [4]:
def median(my_data):
    """
    calculates the median of your data
    
    Input:
    my_data: an array of numbers (floats)
    
    Output: a float that represents the median of your data
    """
    ### BEGIN SOLUTION
    return round(np.median(my_data),2)
    ### END SOLUTION

In [5]:
def mode(my_data):
    """
    calculates the mode of your data
    
    Input:
    my_data: an array of numbers (floats)
    
    Output: a flaot that represents the mode of your data
    """
    ### BEGIN SOLUTION
    return round(list(stats.mode(my_data, axis = None)[0])[0],2)
    ### END SOLUTION

In [6]:
def data_range(my_data):
    """
    calculates the range of your data
    
    Input:
    my_data: an array of numbers (floats)
    
    Output: a float that represents the range of your data
    """
    ### BEGIN SOLUTION
    return round(np.ptp(my_data),2)
    ### END SOLUTION

In [7]:
def standard_deviation(my_data):
    """
    calculates the standard deviation of your data
    
    Input:
    my_data: an array of numbers (floats)
    
    Output: a float that represents the standard deviation of your data
    """
    ### BEGIN SOLUTION
    return round(np.std(my_data),2)
    ### END SOLUTION

In [8]:
# Please ignore this cell. This cell is for us to implement the tests 
# to see if your code works properly. 
### BEGIN HIDDEN TESTS
import numpy as np
from scipy import stats
my_data = [1, 2, 3, 4, 5]
assert(float(mean(my_data)) == round(np.mean(my_data),2))
assert(float(median(my_data)) == round(np.median(my_data),2))
assert(float(mode(my_data)) == round(list(stats.mode(my_data, axis = None)[0])[0],2))
assert(float(data_range(my_data)) == round(np.ptp(my_data),2))
assert(float(standard_deviation(my_data)) == round(np.std(my_data),2))
### END HIDDEN TESTS

* **1.3** Create a histogram for your data, properly formatting your figure.

**2.** Interpret the descriptive stats: What can you say about the neighborhood based on these values? Is the distribution skewed? Is your visualization in agreement with the descriptive statistics? Explain. [#professionalism, #descriptivestats, #algorithms] (<200 words)

## PART 4: PROBABILITY CONSIDERATIONS [#probability, #algorithms, #dataviz]

**1.** Can the mean of your data be interpreted as the expected value of a random variable? Explain why or why not in detail. (~50 words)

**2.** Suppose something unfortunate happened: you stole too many napkins for your Fermi estimate, so you decided to write all of your variable measurements on separate napkins, one napkin for each location. On your way back to the campus, the wind picked up and blew them all away! Luckily, you managed to collect all of the napkins, but now the data is totally randomly reordered, meaning that you have no idea which napkin corresponds to which location. Suppose that you tried to just guess randomly which napkin goes with which location. In other words, you randomly assign each napkin to a given location.

    * What is the probability that you are unlucky, and sadly NONE of the napkins are matched to the correct location (you guessed all of them wrong)? Estimate this probability using a simulation. Be sure to interpret the result appropriately. See hints below.

In [9]:
def unlucky_probability(ordered_napkins):
    """
    calculate the probability that all of your guesses are wrong
    
    input: A list of the integers in order where the first integer corresponds to the first location, 
    the seocnd to the second location...etc (see the hints below)
    
    output: A float that is rounded up to two decimals (e.g. output of 15.20 means that the probability
    of guessing all of them wrong is 15.20%) 
    
    Here is ann example:
    input: [1, 2]. In this input, 1 refers to the first location and 2 refers to the second location.
    After the unfortunate wind blew away your napkins, you collect them and you have to guess which one refers to
    which location. So you might guess -> [2,1] which would be a wrong guess, or -> [1,2] which would be the right guess.
    In this simple example, the output should be "50" since there is a 50% chance that you guess wrong.
    
    (Note that you should have at least 10 integers corresponding to 10 locations)
     """
    ### BEGIN SOLUTION
    import copy
    import random
    
    counter = 0
    for j in range(1000):
        random_napkins = copy.deepcopy(ordered_napkins)
        random.shuffle(random_napkins)
        matches = 0
        for i in range(len(random_napkins)):
            if random_napkins[i] == ordered_napkins[i]:
                matches+=1
        if matches == 0:
            counter+=1
            
    return (counter / 10) 
    ### END SOLUTION

**3. [Optional]:** What is the expected number of napkins that will be correctly matched to the corresponding location? Estimate this probability using a simulation and interpret the result appropriately.   

**4. [Optional]:** Determine the probability distribution as a function of the number of correctly matched napkins and create a visualization.

**5. [Optional]:** Interpret the distribution based on your previous results.

**6. [Optional]:** Compute the probability or expected value found above or both analytically (without a simulation).

### **Hints:**
* To simplify the problem, you can disregard your actual variable data if you wish, and simply make a new list in Python consisting of the numbers 0 through 9: napkins = [0,1,2,3,4,5,6,7,8,9]. Pretend that this is your stack of napkins with the variable measurements in the correct order. Notice that this data satisfies napkins[i] == i, for all values of i from 0 to 9. Think of the index i as the location label.
* A random permutation of this list can be created with the following code: rand_napkins = np.random.choice(napkins,10,replace=False). You should be able to explain how this function works and why it is relevant for the problem.
* You want to check whether rand_napkins[i] == i, for each value of i from 0 to 9.
* You’ll need to use a loop to create many random lists and repeat the checking procedure, keeping track of the number of matches each time.

## PART 5: REFLECTION[#probability, #variables]

Reflect on your application of the LOs in this assignment. How are the connections in the city mapped to the connections between the different LOs. Also reflect on how your prediction and estimation from parts 1 and 2 compare to the results. (<200 words)

### PYTHON TIPS
Part of the purpose of this assignment is to expose you to and give you practice in using tools for working with data in Python. The following may be useful.

* Participating actively in the weekly structured study sessions will help prepare you to complete the Python portion of this assignment. The weekly session material can be found here.
* Your peer tutors and professors are here to help! Make use of office hours for assistance.
* For other resources to learn Numpy, you can read or watch any of the tutorials found online, such as https://docs.scipy.org/doc/numpy/user/quickstart.html. You do not need to learn everything about this library, just the basics of arrays and reading their entries.
* To learn to plot the necessary figures, read as much of http://matplotlib.org/users/beginner.html as is necessary to perform the required tasks. Additionally, there is an enormous amount of freely available instructional material, with examples, that can be found online.
* As a best practice, your graphics in Jupyter notebooks should be ‘inline.’ If your version does not do this automatically, include %matplotlib inline at the top of your script.
* Reminder: no matter what, your code needs comments. Read this resource about the importance of comments and this one for further guidance.