# PANDAS

A high-level overview of the [Pandas](https://pandas.pydata.org) library.


## Why `pandas`?

 `pandas` is a Python library used for data manipulation and analysis. `pandas` is an industrial strength package that is used in most data analysis projects in the real world.  Learning how to use pandas would also make your projects easier to understand for other data scientists and extend the scope of influence your projects may have.



In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
#sns.set_context("notebook")

![alt text](pandas_illustration.jpeg)

# Series

A "series" is the building block of pandas data.   It can acts kinda like a "dictionary"  (a dictionary is also special  type in python).   They are very useful.  For Pandas the "index" is used to identify things. 


In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [3]:
data['b']

0.5

In [4]:
'a' in data

True

In [5]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

Extending/adding to a series:

In [6]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

We made a series above.  Let's make a simple dataframe from 2 series. 

In [7]:

#There are many ways to make data frames.  Here I use the python "dictionary" type 
# indicated by the curly braces {}.  Each column is named with the "key" field, and there is a list []
# 
myDictionary = {
    "Key": [1,2],
    "Author": ["Dr Seuss","Stephen King"], 
    "Book Title": ["cat in the hat","It"]
}

df = pd.DataFrame( myDictionary )

df

Unnamed: 0,Key,Author,Book Title
0,1,Dr Seuss,cat in the hat
1,2,Stephen King,It


## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. This link describes manu https://realpython.com/pandas-read-write-files/ Today we'll be using read_csv today.   Another very useful one is the ability to read excel files. 


A "csv" file is a "comma separated value" file.  It's a nice and simple text format that separates things in the files by commas.  For example:
Participant,ResponseTime
1,0.50
2,.0386

This is a fairly common file format that can be read by almost every program (e.g. excel, SPSS, python, R)


Pandas stores things in something known as a "dataframe". 


In [8]:
elections = pd.read_csv("elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


We can use shape to geth information about the shape of this dataset

In [9]:
elections.shape

(25, 5)

We can use the head command to return only a few rows of a dataframe.

In [10]:
elections.head(10)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


There is also a tail command.

In [11]:
elections.tail(7)

Unnamed: 0,Candidate,Party,%,Year,Result
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win
23,Biden,Democratic,51.3,2020,win
24,Trump,Republican,46.8,2020,loss


When reading data column names are ideally unique. But if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.  Just good to know, many datasets in the wild have duplicate names. 

In [12]:
dups = pd.read_csv("duplicate_columns.csv")
dups

Unnamed: 0,name,name.1,flavor
0,john,smith,vanilla
1,zhang,shan,chocolate
2,fulan,alfulani,
3,hong,gildong,banana


## Indexing, Slicing, Dicing

After reading in data, the most common operaton is selecting data.   With pandas dataframes there are a bunch of powerful ways to access data.  I'll step through a few now. 

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

This is start of where syntax will get a bit confusing.  

### Selection Using Label/Index, with `loc`

**Column Selection** 

To select a column of a `DataFrame` by column label, the safest and fastest way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]`. Remember that the colon `:` means "everything." For example, if we want the `color` column of the `ex` DataFrame, we would use: `ex.loc[:, 'color']`

- You can also slice across columns. For example, `baby_names.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- *Alternative:* While `.loc` is invaluable when writing production code, You will often see an alternative the `[]` method, which takes on the form `df['colname']`.  I advise against using this until you are more comfortable with dataframes.  It has a few idiosyncracies that make it confusing. 

**Row Selection**

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (ie. primary key) of the DataFrame.

### We wil go through a bunch of examples now. 


In [41]:
#Show the first 6 values. 
elections.loc[0:6]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss


In [52]:
#Show the party column
# Note that the ":" means everything
elections.loc[:,'Party']

0      Republican
1      Democratic
2     Independent
3      Republican
4      Democratic
5      Republican
6      Democratic
7      Democratic
8      Republican
9     Independent
10     Democratic
11     Republican
12    Independent
13     Democratic
14     Republican
15     Democratic
16     Republican
17     Democratic
18     Republican
19     Democratic
20     Republican
21     Democratic
22     Republican
23     Democratic
24     Republican
Name: Party, dtype: object

In [43]:
# Show just the Candidate names for the first 6 values. 
elections.loc[0:6,'Candidate']


0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
5        Bush
6     Dukakis
Name: Candidate, dtype: object

The .loc[] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [48]:
#Select two columns from the dataframe
elections.loc[1,["Candidate", "Party"]]

Candidate        Carter
Party        Democratic
Name: 1, dtype: object

## [] Shorthand
You will often see people skip the .loc syntax of df.loc[] and use syntax like: df[].   This 

The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

## Which can get really, really confusing!!

In [16]:
elections[0:3]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss


The way to think of this is that the table is fundamentally a table with rows and columns.  Columns have names, rows have numbers by default. What we did above was shorthand.  When we didn't ask for a specific column or row we got all of them back.  

WHen you start selecting both there are a lot of [] to keep track of.   

In [17]:
elections[["Candidate","Party"]][0:3]

Unnamed: 0,Candidate,Party
0,Reagan,Republican
1,Carter,Democratic
2,Anderson,Independent


If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer.  The next cell has an intentional error.   You will see these "KeyError" messages often when working with pandas.   It just means it can't find what you're looking for.  Usually because of a typo. 

In [21]:
#Since the code above with the range 0:3 worked y9ou might expect to be able to pick a single value.
# That would be wrong.  THe following code fails for a "KeytError"

elections[["Candidate","Party"]][1]  #Incorrect code that will crreate an error. 

KeyError: 1

The moral of these errors is to stick with .loc[] for clarity.  

# Strings VS Numbers

Labels for rows and columns can be strings and/or numbers.  Another common confusion is that the number 1 is treated as as **different** from the string "1".

Yes.  This can be annoying.  

In [53]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

Unnamed: 0,1,1.1
0,topdog,topcat
1,botdog,botcat


In [None]:
weird[1] #try to predict the output

In [None]:
weird["1"] #try to predict the output

In [None]:
weird[1:] #try to predict the output

#Yes

## Boolean Array Selection

Now let's start doing some more interesting things. 

The `.loc[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [None]:
elections

In [57]:
elections.loc[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True, True, True]]

Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win
23,Biden,Democratic,51.3,2020,win
24,Trump,Republican,46.8,2020,loss


One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [None]:
elections

In [69]:
iswin = elections.loc[:,'Result'] == 'win'
iswin.head(5)

0     True
1    False
2    False
3     True
4    False
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [70]:
elections.loc[iswin]

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. 

This syntax is a little tricky to read at first, but you'll get used to it quickly.

In [71]:
elections.loc[elections.loc[:,'Result'] == 'win']

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
19,Obama,Democratic,51.1,2012,win
22,Trump,Republican,46.1,2016,win


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [72]:
win50plus = (elections.loc[:,'Result'] == 'win') & (elections.loc[:,'%'] < 50)

In [73]:
win50plus.head(5)

0    False
1    False
2    False
3    False
4    False
dtype: bool

Here I'll slip into the '[]' mode instead of the .loc[] mode so you can 

In [75]:
elections[win50plus]


Unnamed: 0,Candidate,Party,%,Year,Result
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win
14,Bush,Republican,47.9,2000,win
22,Trump,Republican,46.1,2016,win


The | operator is the symbol for or.

In [None]:
elections[(elections['Party'] == 'Republican')
          | (elections['Party'] == "Democratic")]

If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [None]:
elections['Party'].isin(["Republican", "Democratic"])

In [None]:
elections[elections['Party'].isin(["Republican", "Democratic"])]

# Query()

An alternate simpler way to get back a specific set of rows is to use the `query` command. The query command allows a method of access that many people find much more natural and easy to read. 

In [76]:
elections.query?

[0;31mSignature:[0m [0melections[0m[0;34m.[0m[0mquery[0m[0;34m([0m[0mexpr[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m [0;34m*[0m[0;34m,[0m [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m [0;34m**[0m[0mkwargs[0m[0;34m)[0m [0;34m->[0m [0;34m'DataFrame | None'[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Query the columns of a DataFrame with a boolean expression.

Parameters
----------
expr : str
    The query string to evaluate.

    You can refer to variables
    in the environment by prefixing them with an '@' character like
    ``@a + b``.

    You can refer to column names that are not valid Python variable names
    by surrounding them in backticks. Thus, column names containing spaces
    or punctuations (besides underscores) or starting with digits must be
    surrounded by backticks. (For example, a column named "Area (cm^2)" would
    be referenced as ```Area (cm^2)```). Column names which are Python keyword

In [77]:
elections.query("Result == 'win' and Year < 2000")

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
3,Reagan,Republican,58.8,1984,win
5,Bush,Republican,53.4,1988,win
7,Clinton,Democratic,43.0,1992,win
10,Clinton,Democratic,49.2,1996,win


## Warning here.  


We didn't do it above.  But it's possible to use names for rows as well as columns.  

Note: The `loc` command won't work with numeric arguments if we're using a dataframe that has labeled rows instead.


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [None]:
elections.loc[0:4, 'Candidate':'Year']

If we omit the column argument altogether, the default behavior is to retrieve all columns. 

In [None]:
elections.loc[[2, 4, 5]]

Loc also supports boolean array inputs instead of labels. The Boolean arrays _must_ be of the same length as the row/column shape of the dataframe, respectively (in versions prior to 0.25, Pandas used to allow size mismatches and would assume the missing values were all False, [this was changed in 2019](https://github.com/pandas-dev/pandas/pull/26911)).

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              [True, False, False, True, True] # column mask
             ]

In [None]:
elections.loc[[0, 3], ['Candidate', 'Year']]

We can use boolean array arguments for one axis of the data, and labels for the other.

In [None]:
elections.loc[[True, False, False, True, False, False, True, True, True, False, False, True, 
               True, True, False, True, True, False, False, False, True, False, False], # row mask
              
              'Candidate':'%' # column label slice
             ]

What do you think happens if you give a single value  arguments for the requested rows AND columns?

In [78]:
elections.loc[15, '%']

48.3

## Positional access with `iloc`

.loc[]'s cousin .iloc[] is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is **exclusive**, just like standard Python slicing of numerical values.

In [None]:
elections.head(5)

In [None]:
elections.iloc[:3, 2:]

We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Year for candidates that won with more than 50% of the vote.

In [None]:
elections.head(10)

In [None]:
elections.iloc[[0, 3, 5], [0, 3]]

In [None]:
elections.loc[[0, 3, 5], "Candidate":"Year"]

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

## Sampling

Pandas dataframes also make it easy to get a sample. We simply use the `sample` method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace=True` if you want replacement.

This is very useful for big datasets and you want to get an idea for what things are in it without drowning in a huge output or only seeing the top. 

In [None]:
elections.sample(10)

In [None]:
elections.query("Year < 1992").sample(50, replace=True)


## Handy Properties and Utility Functions for Series and DataFrames

#### Python Operations on Numerical DataFrames and Series

Consider a series of only the vote percentages of election winners.  I'm going to write two methods that do exactly the same thing below.  One uses multiple lines of code, one squashes them together. 

In [81]:
#First create a winners dataframe of just the winners
winners = elections.query("Result == 'win'")
#Next take just the "%" column from the winners dataframe.   
winners_percentage = winners["%"]
winners_percentage

0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
23    51.3
Name: %, dtype: float64

In [79]:
#Does the same as above.   But doesn't create another variable "winners".  
#You'll often see people chaining multiple things together on a single line. 
winners_percentage = elections.query("Result == 'win'")["%"]
winners_percentage

0     50.7
3     58.8
5     53.4
7     43.0
10    49.2
14    47.9
16    50.7
17    52.9
19    51.1
22    46.1
23    51.3
Name: %, dtype: float64

We can perform various Python operations (including numpy operations) to DataFrames and Series.

In [None]:
max(winners_percentage)

In [None]:
np.mean(winners_percentage)

#### Handy Utility Methods

The head, shape, size, and describe methods can be used to quickly get a good sense of the data we're working with. Remember when I said above we can use names for labeling rows?  This is a good dataset to demonstrate that. 

In [None]:
mottos = pd.read_csv("mottos.csv", index_col="State")

In [None]:
mottos.head(20)

In [None]:
mottos.size

The fact that the size is 200 means our data file is relatively small, with only 200 total entries.

In [None]:
mottos.shape

Since we're looking at data for states, and we see the number 50, it looks like we've mostly likely got a complete dataset that omits Washington D.C. and U.S. territories like Guam and Puerto Rico.

In [None]:
mottos.describe()

Above, we see a quick summary of all the data. For example, the most common language for mottos is Latin, which covers 23 different states. Does anything else seem surprising?

We can get a direct reference to the index using .index.

In [None]:
mottos.index

We can also access individual properties of the index, for example, `mottos.index.name`.

In [None]:
mottos.index.name

This reflects the fact that in our data frame, the index IS the state name!

In [None]:
mottos.head(2)

It turns out the columns also have an Index. We can access this index by using `.columns`.

In [None]:
mottos.head(2)

There are also a ton of useful utility methods we can use with Data Frames and Series. For example, we can create a copy of a data frame sorted by a specific column using `sort_values`.

In [None]:
elections.sort_values('%', ascending=False)

As mentioned before, all Data Frame methods return a copy and do **not** modify the original data structure, unless you set inplace to True.

In [None]:
elections.head(5)

If we want to sort in reverse order, we can set `ascending=False`.

In [None]:
elections.sort_values('%', ascending=False)

We can also use `sort_values` on Series objects.

In [None]:
mottos['Language'].sort_values(ascending=False).head(10)

For Series, the `value_counts` method is often quite handy.

In [None]:
elections['Party'].value_counts()

In [None]:
mottos['Language'].value_counts()

Also commonly used is the `unique` method, which returns all unique values as a numpy array.

In [None]:
mottos['Language'].unique()

## Baby Names Data

Now let's play around a bit with a large baby names dataset that is publicly available. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough we're going to look at only California rather than looking at the national dataset.

In [83]:
import urllib.request
import os.path
import zipfile
import pandas as pd

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

Unnamed: 0,State,Sex,Year,Name,Count
412150,CA,M,2023,Gilberto,20
150815,CA,F,2000,Vania,13
136558,CA,F,1996,Melyssa,13
308830,CA,M,1984,Weston,40
348963,CA,M,2001,Myles,100


In [4]:
#Note that the babynames dataset includes both numeric and non-numeric information.  describe() defaults to just showing
# numbers for mixed datasets, include='all' will show summaries for all columns.  But look at this.  Just because it can
# doesn't mean an output is useful. 
babynames.describe(include='all')

Unnamed: 0,State,Sex,Year,Name,Count
count,407428,407428,407428.0,407428,407428.0
unique,1,2,,20437,
top,CA,F,,Jean,
freq,407428,239537,,223,
mean,,,1985.733609,,79.543456
std,,,27.00766,,293.698654
min,,,1910.0,,5.0
25%,,,1969.0,,7.0
50%,,,1992.0,,13.0
75%,,,2008.0,,38.0


# Excercises

Here are a list of questions to pull from the babynames dataset.  

How many unique names exist in the dataset? 

What was the most popular male name?

What was the most popular female name?

What was the most popular female and male name in 2018? 

What were the top-10 names for the year you were born?

## AI analysis. 
I had AI generate answers for the following question:

What was the most popular name in any given year? (i.e. what year had the most people with the same name and what was that name?)

These answers are wrong.  Try and run the code and decide what it is doing and why it is wrong. 


## Harder goals:

How many different names were given to Males compared with Females? In 1960? in 2020?


What other questions can we ask?  Give me some questions!  


In [None]:
#How many unique names exist in the dataset? 

#replace the "..." and "function_name" below to produce the response. 
#You can also try other structure/methods to accomplish the goal if you wish. 
babynames.loc[...].function_name().function_name()



In [None]:
#What was the most popular male name?

male_babies = ...
male_babies.sort_values(...)...


In [None]:
#What was the most popular female and male name in 2018? 

#The following structure is intdended to create 2 seperate variables and print them.  
#You can use another structure and/or create seperate cell to do male/female seperation. 

#Extract only females from 2018
femaleNames2018 = ...
#Extract only males from 2018
maleNames2018 = ...

#Extract most popular name from female dataset
mostPopularFemale2018 = femaleNames2018.sort_values(...)...
#Extract most popular name from male dataset
mostPopularMale2018 = maleNames2018.sort_values(...)...


print(mostPopularFemale2018)
print(mostPopularMale2018)

In [None]:
#What were the top-10 names for the year you were born?

#Now try this without any scaffolding prompts



AI attempts:


Here are AI attempts at the following question (Using ChatGPT 3.5). 

What was the most popular name in any given year? (i.e. what year had the most people with the same name and what was that name?)

I asked it three times and these are the responses I got.  All of these outputs are wrong.    Why?  Some of them produce outputs but not what is being asked for. Sometimes there are errors as the code won't execute.  Try and identify the reason for the bug. 

Hint:  First understand exactly what output the question is asking for.  
Further hint:  There are subtle assumptions in the phrasing of this question.   Data science often hits these issues. It's easy to ask a question, but when you actually try to quantify the question there's often unstated issues that aren't clear.  


In [6]:
#Attempt one. 
most_popular_name_year = babynames.groupby('Year')['Name'].agg(lambda x: x.value_counts().index[0])
print("Most popular name by year:\n", most_popular_name_year)

Index(['State', 'Sex', 'Year', 'Name', 'Count'], dtype='object')

In [None]:
#Attempt two.
most_popular_name_year = babynames.groupby(['Year', 'Name'])['Count'].sum().reset_index()
year_with_most_popular_name = most_popular_name_year.loc[most_popular_name_year.groupby('Year')['Count'].idxmax()]

print("Year with the most popular name in each year:")
print(year_with_most_popular_name)

In [None]:
#Attempt three. This one doesn't execute.
#First why?  What line is causing the problem?
#What is the specific problem the error message is identifying?
#Next what part of the line is causing the problem?
#There is a specific variable that is causing the problem what is it and what is it's value(write some code to check it's value.)
#Optional harder task fix the problem. 
year_with_most_popular_name = babynames.groupby(['Year', 'Name'])['Count'].sum().idxmax()
most_popular_name_year = babynames.loc[year_with_most_popular_name]['Name']
most_popular_name_count = babynames.loc[year_with_most_popular_name]['Count']

print("Year with the most popular name:", year_with_most_popular_name[0])
print("Most popular name in that year:", most_popular_name_year)
print("Total count for the most popular name:", most_popular_name_count)