# Fundamentals of Social Data Science 

Week 1. Day 1. Exercises from Chapter 1 of FSStDS 

Within your week 1 study pod discuss the following questions. Please submit an individual assignment by 12:30pm tomorrow, Tuesday October 11, 2022 on Canvas. 

These will not be marked and are solely for recordkeeping and review upon request. They will, however, be discussed in the Tuesday tutorial and briefing.

# Exercise 1. Data as operationalisation 

In the book we talk about data as being measurements from the world. The measurements represent phenomena but are not the phenomena themselves. To this end, we think of social data science as a 'science of the operationalisation of the social world'. Below are several key concepts about the world that we can operationalise in a variety of ways. For each of the concepts think of: 
1. A way that you can measure this concept in a survey. Is there a scale that people have used? Has this been mentioned in an academic paper? 
2. A question that you can ask someone in order to get a more indepth response about the topic.
3. A set of data from a social media platform that might strongly predict with the measure, either directly or indirectly. Do you think that you could collect this data just by browsing or would you need to access this data in a more structured form?


| Topic                      | Survey Q. | Interview Q. | Trace data |
|----------------------------|-----------|--------------|------------|
| 1. Number of close friends |           |              |            |
| 2. Political affiliation   |           |              |            |
| 3. Preferred social media  |           |              |            |



## Answer 1.

Please fill in the table, either in the markdown or below just in text. 

As an optional challenge, think about looking online for sources where people have done any of these. Can you find at least one academic paper for each of the nine cells? 

__Answer below here:__ 
1. Number of close friends (as a survey Q, interview Q, and as trace data):
    
    a. Survey Q: ["How many friends do you have?"](https://www.researchgate.net/profile/Kiyoko-Kamibeppu/publication/7802997_Impact_of_the_Mobile_Phone_on_Junior_High-School_Students%27_Friendships_in_the_Tokyo_Metropolitan_Area/links/55ba16de08aec0e5f43e7a4a/Impact-of-the-Mobile-Phone-on-Junior-High-School-Students-Friendships-in-the-Tokyo-Metropolitan-Area.pdf)

    b. Interview Q: ["List up to 5 of your best friends"](https://pure.uvt.nl/ws/portalfiles/portal/518378/SN_Kalmijn.pdf)

    c. Trace Data: ["Uses 72 measures from Facebook to predict closeness of friendships amongst random sample from 'Facebook Friends'"](https://redirect.cs.umbc.edu/courses/graduate/CMSC691/spring22/pdf/1518701.1518736.pdf)

2. Political affiliation (as a survey Q, interview Q, and as trace data):

    a. Survey Q: ["Liberal to Conservative on a numeric scale"](https://digitalcommons.chapman.edu/cgi/viewcontent.cgi?article=1001&context=sociology_articles)

    b. Interview Q: ["Discussing or writing essays on personal political status and views"](https://digitalcommons.unomaha.edu/cgi/viewcontent.cgi?article=1029&context=slceciviceng)

    c. Trace Data: ["Supervised learning classification in social networks"](https://www.dhi.ac.uk/san/waysofbeing/data/economy-crone-colleoni-2014.pdf)

3. Preferred social media (as a survey Q, interview Q, and as trace data):

    a. Survey Q: ["Social Media X is good for..., Social Media Y is good for ..."](https://core.ac.uk/download/pdf/268003906.pdf), ["Rank platform preference"](https://www.researchgate.net/profile/Kristalyn-Gallagher/publication/327675994_Are_You_on_the_Right_Platform_A_Conjoint_Analysis_of_Social_Media_Preferences_in_Aesthetic_Surgery_Patients/links/5dc0c99c299bf1a47b154775/Are-You-on-the-Right-Platform-A-Conjoint-Analysis-of-Social-Media-Preferences-in-Aesthetic-Surgery-Patients.pdf)

    b. Interview Q: "What platforms do you prefer and what do you use each one for?"

    c. Trace Data: ["Multi Platform LDA (latent dirichlet allocation, topic-modeling method)"](https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=4653&context=sis_research)

In each of these examples, the survey and interview questions would be relatively straight forward to interpret but would require upfront administrative costs related to collecting survey results and conducting an interview. For all of these three topics, the Trace Data requires a rather sophisticated set of measures or analysis that make interpretation far more difficult than their interview/survey counterparts. I think surveys/interviews are more direct routes to the human-value centric phenomena we're looking to understand. Whereas trace data is information that encodes social data into a technical/operational context which we then have to re-interpret. The switching of contexts leads to some info loss in each transition.

__Answer above here__

# Exercise 2. FREE coding 

Take the following function and try to find a way to refactor it so that it can:
1. [Be functioning] Give the right output with the right input (find the bug), 
2. [Be robust] Give a missing value with the wrong input (what if we sent it a number?), 
3. [Be elegant] Have less repetition (can we simply the `elif` statements)?, 
4. [Be efficient] Use a more efficient algorithm (did the last line take care of all the inefficiencies?)

*Challenge*: The function takes a string and returns only vowels. That means `"y"` is an edge case. What will you do with it? The most sophisticated NLP packages might know which `"y"` is a vowel. Do we need to go that far? Can we warn people or somehow use a parameter for an option to include or exclude `"y"`?

In [94]:
%%time

def return_only_vowels(text):
    newtext = ""
    for letter in text: 
        if letter == 'a':
            newtext += letter
        elif letter == 'e':
            newtext += letter
        elif letter == 'i':
            newtext += letter
        elif letter == 'o':
            newtext += letter
        elif letter == 'u':
            newtext += letter
        
    return text

text1 = "The quick brown fox jumped over the lazy dog"
text2 = "It was the best of times, it was the blurst of times."
text3 = "A stitch in time saves 9"

result_list = []
for text in [text1, text2, text3]:
    result_list.append(return_only_vowels(text))

for result in result_list: 
    print(result)

The quick brown fox jumped over the lazy dog
It was the best of times, it was the blurst of times.
A stitch in time saves 9
CPU times: user 30 µs, sys: 2 µs, total: 32 µs
Wall time: 33.9 µs


## Answer 2. 

Below try to rewrite the function and the inputs so that it runs faster, returns the right text, handles bad input gracefully, and reads a little better: 

In [95]:
import re

In [96]:
%timeit -n 1000 return_only_vowels(text1)

2.96 µs ± 186 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [97]:
def return_only_vowels_regex(text, includeY = False):

    vowels = '[^aeiouyAEIOUY]' if includeY else '[^aeiouAEIOU]'

    if type(text)==  str:
        return re.sub(vowels, '', text)
    else: return None

assert return_only_vowels_regex(text1) == 'euiooueoeeao'
assert return_only_vowels_regex(text2) == 'Iaeeoieiaeuoie'
assert return_only_vowels_regex(text3) == 'Aiiieae'

assert return_only_vowels_regex(text1, includeY = True) == 'euiooueoeeayo'
assert return_only_vowels_regex(text2, includeY = True) == 'Iaeeoieiaeuoie'
assert return_only_vowels_regex(text3, includeY = True) == 'Aiiieae'

assert return_only_vowels_regex(10) is None


In [98]:
%timeit -n 1000 return_only_vowels_regex(text1)

2.9 µs ± 148 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [99]:
def return_only_vowels_list(text, includeY = False):

    vowels = 'aeiouyAEIOUY' if includeY else 'aeiouAEIOU'

    if type(text)==  str:
        return ''.join([letter if letter in vowels else '' for letter in text])
    else: return None

In [100]:
%timeit -n 1000 return_only_vowels_list(text1)

1.93 µs ± 87.2 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


My first approach was to use a regex and replace all non-vowels with an empty string but that took approximately the same time as the first function. My group suggested just iterating through a string of vowels which ended up being faster and probably easier to understand. 

# Exercise 3. Pseudocode

Pseudocode a recipe for making a pizza! It should have a dough base, a sauce, and two toppings. No worries about making it more complicated even if a great pizza can be an art. 

Some questions: 
1. Will you ask the user for what toppings they want?
2. What assumptions will you make about the ingredients? That is, will you assume they are already cooked or otherwise prepared? 
3. What assumptions will you make about the pizza oven? 

## Answer 3. 

Below write the pseudocode. Share it with a friend of yours and ask: do you think they would make the same pizza as you with these instructions? What might vary? 

__Answer below here__: 

    class Pizza:

        def __init__(self):

            self.dough
            self.roundness

        def toss(self):
            self.roundness =+ 1

        def stretchDough(self):

            round = False

            while(self.roundness < 10):
                self.toss():

        def addGarlicButter(self):
            garlicButter = getGarlicButter()
            slather(self.dough, garlicButter)
        
        def addSauce(self, sauce):
            slather(self.dough, sauce)

        def addToppings(self, toppings):

            for topping in toppings:
                cover(self.dough, topping)

        def cook(self):
            pizzaOnPaddle = onPaddle(self.dough)
            inFire(pizzaOnPaddle)
            time.wait(900)
            outFire(pizzaOnPaddle)

        def serve(self):

            chefsKiss()
            sayPerfetto()
            giveToCustomer()


    def make_me_pizza(dough = "Plain", sauce = "Marinara", toppings = ["Cheese", "Pepporoni"]):

        pizza = new Pizza()
        pizza.stretchDough()

        if dough == "Garlic": pizza.addGarlicButter()

        pizza.addSauce(sauce)
        pizza.addToppings(toppings)
        pizza.cook()
        pizza.serve()


__Answer above here__

# Exercise 4. What data is available for whom? 

Jeremy Singer-Vine has been compiling a list of really interesting data sets for several years. He shares these via his mailing list "data is plural". The most recent version of this list is available [here on Google Sheets](https://docs.google.com/spreadsheets/d/1wZhPLMCHKJvwOkP4juclhjFgqIY8fQFMemwKL2c64vk/edit#gid=0). 

Browse through this list of data sets. Below are some questions to ask of any given row signifying a data set:

A. By viewing the summary of the data, give an example of a distribution that could be stored and summarised.

B. With this data, what is excluded? 
  - Would certain cases or classes of people/things be excluded that could be considered? 
  - Would other data about the existing cases could be useful or interesting? 
  - Could we merge in data to compensate or would we need to do a separate data collection effort?
  - Would accessing this data be ethically reasonable for academic research?

## Answer 4. 

Please select a data set and answer the questions above. 

__Answer below here__:

    From Cornell's "Movie Chatter" data (row 81), one could collect the list of 9,035 character names that belong to the list of movie conversations. 

        1. One likely bias of this dataset is that it will over represent male characters as the famous Bechdel test has been used in the past to show the concerning lack of dialogue between women (that's not about men). While it's possible that there are many conversations between women about men in this dataset as well, it's likely that there are more men than women characters with dialogue in movie's. 
        2. Additional data about the characters could include the name of the actor/actress that played them. That could be used to analyze which actors/actresses get the most dialogue within the corpus. 
        3. In this case, we would have to separately collect data in order to get actor/actress names. This could potentially be webscraped from IMDB or a possible database from the Academy.
        4. I think accessing this data would be entirely ethical as it's all public information and the actors/actresses have already chosen to put their names and likeness into the public sphere.

__Answer above here__