## Demo [35 minutes]
1. Demonstrate accessing the [Course GitHub Repo](https://github.com/pointOfive/STA130_ChatGPT) from Quercus and opening a [UofT Jupyterhub](https://datatools.utoronto.ca) [or google collab] notebook 
2. Demonstrate Jupyter Notebooks as a "python calculator"
3. Demonstrate Jupyter Notebooks as a ["Markdown editor"](http://markdownguide.org)
4. Demonstrate using [ChatGPT](https://chat.openai.com/) to find an amusing, funny, or otherwise interesting data set available online with missing values 
> Demonstrate a live interaction similiar to this [helpful and productive ChatGPT script](https://chatgpt.com/share/03ab2b84-e4f7-41bf-bb3f-75e0c1298086) 
> - "handling/imputing/filling/assumptions regarding missing values" are "out of scope" for STA130 and should be noted as such
> - iteractions can be [frustrating and annoying](https://chat.openai.com/share/b1b053eb-e168-40b9-9196-25f6dfcbf14b) if ChatGPT (a) is prompted too vaguely, or (b) is asked to an overly complex "overwhelming" question
> - ChatGPT is not a substitue for reviewing data yourself at a data set repository such as [TidyTuesday](https://github.com/rfordatascience/tidytuesday), but interactions with ChatGPT might help brainstorm ideas and provide a way to "search for content" (even within a specific website)
>
> This interaction will naturally introduce `import` and `pandas`, loading the data into your notebook, and assessing missing values in a data set
> - ChatGPT generally responds using `python` but the example above additionally primed ChatGPT to respond using pandas and an online url (as opposed to a local file)
> - The "data set urls" that ChatGPT provides are often not real and ChatGPT quickly forgets about the request for a data set with missingness (which is likely not information it has access to for many specific data sets), so both the data set and present missingness need to actually be checked and confirmed

5. Demonstrate saving "ChatGPT log history" and share this with the students through a piazza post and a Quercus announcement

6. Demonstrate saving your python jupyter notebook in your own account and "repo" on [github.com](github.com)

## Communication
        
1. In 8 groups of 3, or thereabouts...
    1. **[15 minutes]** Ice breakers / introductions
        1. Each person may bring two emojis to a desert island... reveal your emojis at the same time... for emojis selected more than once the group should select one additional emoji
        2. Where are you from, what do you think your major might be, and what's an "interesting" fact that you're willing to share about yourself?
        
    2. **[10 minutes]** These are where all the bullet holes were in the planes that returned home so far after some missions in World War II
        1. Where would you like to add armour to planes for future missions?
        2. Hint: there is a hypothetical data set of the bullet holes on the planes that didn't return which is what we'd ideally compare the data set we observe to...
        
           ![](https://upload.wikimedia.org/wikipedia/commons/thumb/b/b2/Survivorship-bias.svg/640px-Survivorship-bias.svg.png)
           
    3. **[10 minutes]** Monte Hall problem: there are three doors, one of which has a prize, and you select one and the gameshow host reveal one of the other two doors to not have the prize... would you like to change your guess to the other unknown door?

       ![](https://mathematicalmysteries.org/wp-content/uploads/2021/12/04615-0sxvwbnzvvnhuklug.png) 
    
4. **[30 minutes]** Review the experience of the groups for the WW2 planes and Monte Hall problems
    1. For each of these problems, have students vote on whether their groups (a) agreed on answers from the beginning, (b) agreed only after some discussion and convincing, or (c) retained somewhat divided in their opinions of the best way to proceed
    2. Briefely discuss a few of the general answers the groups arrived at
    3. Prompt ChatGPT to introduce and explain "survivorship bias" using spotify songs as an example (like [this](https://chatgpt.com/c/25ea6a5c-f408-4d81-8951-f14e10678d6d) or you could intstead try to approach things [more generally](https://chat.openai.com/share/6a69b83f-4af7-4de1-8e35-294decb926e7)) and see if students are able to generalize this idea for the WW2 planes problem and if they find it to be a convincing argument to understand the problem
    4. Prompt ChatGPT to introduce and explain the Monte Hall problem and see if the students find it understandable and convincing
        1. Demonstrate that [ChatGPT wrongly calculates the probability as 1/2 when asked to express thier answer probabilistically](https://chat.openai.com/share/26f74c54-5358-431e-b97a-315fdcb4e2c9), showing that there are clear limits to how deeply ChatGPT is able to "reason" and that information that ChatGPT uses to create responses can be wrong
        
        2. Discuss how the nature of prompting and follow up questions influences the nature of the conversation and the degree to which the discussion can be directed and extended 

## Homework [0 minutes]

> Code and write all your answers (for both the "Prelecture" and "Postlecture" HW) in a python notebook (in code and markdown cells) and include the link(s) to your GPTchat transcript log at the top of your notebook. Save your python jupyter notebook in your own account and "repo" on [github.com](github.com) and submit a link to that notebook though Quercus for assignment marking. 
> 
> *The marking rubic (which may award partial credit) is as follows:*
> - [0.1 points]: All GPTchat transcript logs are reported at the top of the notebook
> - [0.2 points]: Reasonable general definitions for "2.B"
> - [0.2 points]: Demonstrated understanding regarding "4"
> - [0.2 points]: A sensible justification for the choice in "7.D"
> - [0.3 points]: Requested assessment of ChatGPT and google in "8"



### Prelecture HW

1. Pick one of the data sets from your **TUT demo** (or the [example demo](https://chatgpt.com/share/03ab2b84-e4f7-41bf-bb3f-75e0c1298086) or your own version of the ChatGPT session if you wish) and use the code from the ChatGPT interaction to import the data confirm that the data set has missing values

```python
# For example, based on the "example demo" you might use
import pandas as pd
url = "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv"
df = pd.read_csv(url)
print(df.isnull().sum()) 
```

2. Start a new ChatGPT session with an initial prompt introducing to the data set you're using and a request for help analyzing the data
    > Something like "I've downloaded a data set about characters from animal crossings (from https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-05/villagers.csv), and I'd like to get your help analyzing some things about this data" would work
    > - ChatGPT will allow you to upload your data, but if you do so [ChatGPT will end the session without providing any analysis unless you purchase an upgrade of ChatGPT](https://chatgpt.com/share/87305415-b2ea-4621-bfc2-00dc59dcaf2c), so instead respond by telling ChatGPT, "I've downloaded the data and want to understand the size (or dimensions) of the data set." 
    > > For this class we don't want ChatGPT to just do the analysis for us. We instead want ChatGPT to help us understand how to analyze the data ourselves. Therefore, **DO NOT purchase an upgrade of ChatGPT**. 

    1. Use code provided by ChatGPT in response to the prompts suggested above to print out the number of rows and columns of the data set 
    2. Write your own general definitions of the meaning of "observations" and "variables" based on asking ChatGPT to explain these terms in the context of your data set  


3. Ask ChatGPT how to go about summarizing the data set and use the code suggested by ChatGPT to provide these summaries for your data set.
    > ChatGPT may provide you with a wide variety of approaches and techniques for summarizing your data set. We're looking for you to use `df.describe()` and give a few examples of `df['column'].value_counts()`, so re-prompting ChatGPT with something like "What's the simplest form of summarization of this data set that I could do and how do I do it in Python?" could be helpful 
    > - Hint 1: consider dividing the code that ChatGPT provides you into different jupyter notebook cells so that each cell concludes with a key printed result
    > - Hint 2: the last line of code in a jupyter notebook cell will automatically print out in a formatted manner, so replacing something like `print(df.head())` with `df.head()` as the last line of a cell provides a sensible way to organize your code
    > - Hint 3: jupyter notebook printouts usaully don't show all of the data (when there's too much to show, like if `df.describe()` include results for many columns), but the printouts just show enough of the data to give an idea of what the results are which is all we're looking for at the moment
    
                
4. If the data set you're using has missing values in numeric variables, explain (perhaps using help from ChatGPT if needed) the discrepancies between size of the data set given by `df.shape` and what is reported by `df.describe()` with respect to (a) the number of columns it analyzes and (b) the values it reports in the "count" column
    > If the data set you're using does not have missing values in numeric variables (which is most likely the case), instead download and use the "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv" data to answer this question  
    > - Hint: In (a) above, the "columns it analyzes" are the columns of the output of `df.describe()`, but you can can see all the columns in a data set using `df.columns`
    
    
5. Use ChatGPT to help understand the difference between an "attribute" (such as `df.shape` which does not end with `()`) and a "method" (such as `df.describe()` which does end with `()`) and provide your own paraphrasing summarization of that difference

    
    
6. Besides 'count', `df.describe()` provides the 'mean', 'std', 'min', '25%', '50%', '75%', and 'max' summary statistics for each variable it analyzes. Define (perhaps using help from ChatGPT if needed) the meaning of each of these summary statistics 
    > Hint: the answers here should make it obvious why these can only be calculated for numeric variables in a data set, which explains the answer to "5A" above
    

7. Missing data can be considered "across" rows or "down" columns.  Consider (perhaps using help from ChatGPT if needed) how `df.dropna()` or `del df['col']` should be applied to most efficiently use the available data

    > More sophisticated analyses for "filling in" rather than removing missing data (as considered here) are possible (based on making assumptions about missing data and using specific imputation methods or models) but these are "beyond the scope" of STA130 so this can be safely ignored for now
    > - This question is not interested in the general benefits of imputing missing data, or the general benefits of using `df.dropna()` and/or `del df['col']` to remove missing data, just how to most efficiently remove missing data if a user chooses to do so

    1. Provide an example of a "use case" in which using `df.dropna()` might be peferred over using `del df['col']`
    2. Provide an example of "the opposite use case" in which using `del df['col']` might be preferred over using `df.dropna()` 
    3. Discuss why applying `del df['col']` before `df.dropna()` when both are used together would be important
    4. Remove all missing data from one of the data sets you're considering using some combination of `del df['col']` and/or `df.dropna()` and give a justification for your approach


### Postlecture HW

8. This problem will guide you through exploring how to use ChatGPT to provide and troubleshooting code using the "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv" data set.
    > To limit the scope of the reponses from ChatGPT, start a new ChatGPT session with the following slight variation on the initial prompting approach from "2" above
    > - "I am going to do some initial simple summary analyses on the titanic data set (https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv) which has some missing values, and I'd like to get your help understanding the code I'm using and the analysis it's performing"
    > 
    > > Remember, ChatGPT will allow you to upload your data, but for this class we don't want ChatGPT to just do the analysis for us. **DO NOT purchase an upgrade of ChatGPT**. We instead want ChatGPT to help us understand how to analyze the data ourselves. 
    
    1. Use ChatGPT to understand what `df.group_by("col1")["col2"].describe()` does 
    and then demonstrate and explain this using a different example from your data set that ChatGPT doesn't provide for you
        1. If ChatGPT doesn't address how missing values are handled with this code, extend the conversation by asking "How are missing values handled here?"
        > Here's an [example](https://chatgpt.com/share/354faeef-719a-415f-964c-b2fd18dead93) of how you could get this session started
    
    2. Intentionally introduce the following errors into your code and report your opinion as to whether it's easier to (a) work wth ChatGPT to fix the errors, or (b) use google to search for and fix errors 
    > First share the errors you get with ChatGPT and see if you can work with ChatGPT to troubleshoot and fix the coding errors, and then see if you think a google search for the error provides the necessary toubleshooting help more quickly than ChatGPT
    
        1. Forget to include `import pandas as pd` in your code 
        > Use Kernel->Restart from the notebook menu to restart the jupyter notebook session unload imported libraries and start over so you can create this error
        >  - When python has an error, it sometimes provides a lot of "stack trace" output, but that's not usually very important for troubleshooting. For this problem for example, all you need to share with ChatGPT or search on google is `"NameError: name 'pd' is not defined"`

        2. Mistype "titanic.csv" as "titanics.csv" 
            1. If ChatGPT troubleshooting is based on downloading the file, just replace the whole url with "titanics.csv" and try to troubleshoot the subsequent `FileNotFoundError: [Errno 2] No such file or directory: 'titanics.csv'` (assuming the file is indeed not present)
            2. Explore introducing typos into a couple other parts of the url and note the slightly different errors this produces
      
        3. Try to use a dataframe before it's been assigned into the variable
        > You can simulate this by just misnaming the variable. For example, if you should write `df.group_by("col1")["col2"].describe()` based on how you loaded the data, then instead write `DF.group_by("col1")["col2"].describe()`
        > - Make sure you've fixed your file name so that's not the error any more
        
        4. Forget one of the parenthesis somewhere the code
        > For example, if the code should be `pd.read_csv(url)` the change it to `pd.read_csv(url` 
        
        5. Mistype one of the names of the chained functions with the code 
        > For example, try something like `df.group_by("col1")["col2"].describe()` and `df.groupby("col1")["col2"].describle()`
        
        6. Use a column name that's not in your data for the `groupby` and column selection 
        > For example, try capitalizing the columns for example replacing "sex" with "Sex" in `titanic_df.groupby("sex")["age"].describe()`, and then instead introducing the same error of "age"
        
        7. Forget to put the column name as a string in quotes for the `groupby` and column selection, and see if ChatGPT and google are still as helpful as they were for the previous question
        > For example, something like `titanic_df.groupby(sex)["age"].describe()`, and then `titanic_df.groupby("sex")[age].describe()`
    
9. Assuming the code you've been using in the previous problem did not explicitly remove the missing values from the data set in the manner of question "8" from the "Prelecture HW", what do you think the "pros" and "cons" are of not explicitly removing the missing values from the data set? 

> You may want to review the response ChatGPT provided for the prompt "How are missing values handled here?" from question "9Aa" above.

10. Have you interacted with ChatGPT to help you understand all the material in the tutorial and lecture that you didn't quite follow when you first saw it? 
> Just answering "Yes" or "No" or "Somewhat" or "Mostly" or whatever here is fine as this question isn't a part of the rubric; but, the midterm and final exams may ask questions that are based on the tutorial and lecture materials; and, your own skills will be limited by your familiarity with these materials (which will determine your ability to actually do actual things effectively with these skills... like the course project...)

#### Afterward

Here are few ideas of some other kinds of interactions you might consider exploring with ChatGPT...

> While these are likely to be extremely practically valuable, they are not a part of the homework assignment, so do not include anything related to these in your homework submission

- With respect to improving ones ability in statistics, coding, communication, and other key data science skills
    - what is ChatGPTs perception its own capabilities and uses as an AI-driven assistance tool 
    - and does ChatGPTs assessment of itself influence or agree with your own evalution of ChatGPT? 

- ChatGPT can introduce and explain the "World War 2 planes" problem and the "Monte Hall" problem... 
    - how well does it seem to do and showing and explain other "unintuitive surprising statistics paradoxes"?

- If you consider the process of writing about why you chose to take this course, and the skills you were hoping to build through this course with respect to your current ideas about what possible careers 
    - and how do you think the exercise would be different if you framed it as a dialogue with ChatGPT
    - and do you think the difference could be positive and productive, or potentially biasing and distracting?
    
- ChatGPT sometimes immediately responds in simple helpful ways, but other times it gives a lot of information that can be overwheling... are you able to prompt and interact with ChatGPT in manner that keeps its reponses helpful and focused on what you're interested in? 

- ChatGPT tends to respond in a fairly empathetic and supportive tone...
    - do you find it helpful to discuss concerns you might have about succeeding in the course (or entering university more generally) with ChatGPT?
    
- For what purposes and in what contexts do you think ChatGPT could provide suggestions or feedback about your experiences that might be useful? 
