# Downstream Exploitation of Space Data
## Python Crash Course Part 2: Data Analysis

### Learning Objectives

You will: 
* be able to read tabular data with pandas
* know basic data analysis techniques

### pandas

pandas is one of the most commonly-used Python libraries for data analysis. Documentation for it is available here: https://pandas.pydata.org

Another library often used is polars, which is quicker and is better for large datasets. We will not be using it here though.

Let's import pandas:

In [1]:
import pandas as pd # we can shorten the name of the library so that we don't have to type it in full every time

We will now read a .csv (comma separated value) file and save it into a variable:

In [2]:
data = pd.read_csv('iris.csv') # if values in a file would have been separater by another symbol, we wouls also specify it
# e.g. pd.read_csv('iris.csv', sep=';') if the separator is a semicolon (;)

Let's get some information about our dataset:

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal.length  150 non-null    float64
 1   sepal.width   150 non-null    float64
 2   petal.length  150 non-null    float64
 3   petal.width   150 non-null    float64
 4   variety       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [4]:
data.head() # printing the first n rows of the dataframe (5 by default)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [5]:
data.tail() # printint the last n rows of the dataframe (5 by default)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [6]:
data.shape

(150, 5)

In [7]:
data.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [8]:
data['variety'].unique() # getting unique values in our only column with strings

array(['Setosa', 'Versicolor', 'Virginica'], dtype=object)

In [9]:
data.describe() # getting some descriptive statistics about our numerical columns

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### Selecting data

Sometimes (often), you will only need certain parts of the data and not all of it. We can select it as follows:

In [10]:
petals = data[['petal.length', 'petal.width']] # selecting multiple columns
petals.head()

Unnamed: 0,petal.length,petal.width
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


In [11]:
variety = data['variety'] # selecting a single column
variety.head()

0    Setosa
1    Setosa
2    Setosa
3    Setosa
4    Setosa
Name: variety, dtype: object

In [12]:
first_rows = data[0:5] # selecting the first 5 rows, not including index after :
first_rows

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [13]:
first_rows1 = data.loc[0:5] # selecting the first 6 (!) rows, because .loc does label slicing, which is inclusive
first_rows1

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa


In [14]:
row_100 = data.loc[100] # get row 100
row_100

sepal.length          6.3
sepal.width           3.3
petal.length          6.0
petal.width           2.5
variety         Virginica
Name: 100, dtype: object

In [15]:
rows_variety = data.loc[[50,100], 'variety'] # get variety of rows 50 and 100
rows_variety

50     Versicolor
100     Virginica
Name: variety, dtype: object

### Adding data

You can add a new column to your dataset and populate it with 0s like this:

In [16]:
data['sepal_ratio'] = 0

In [17]:
data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,0
1,4.9,3.0,1.4,0.2,Setosa,0
2,4.7,3.2,1.3,0.2,Setosa,0
3,4.6,3.1,1.5,0.2,Setosa,0
4,5.0,3.6,1.4,0.2,Setosa,0
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,0
146,6.3,2.5,5.0,1.9,Virginica,0
147,6.5,3.0,5.2,2.0,Virginica,0
148,6.2,3.4,5.4,2.3,Virginica,0


If we want it to have values we want, i.e. ratio of other two values, we can assign these values like this:

In [18]:
data['sepal_ratio'] = data['sepal.length'] / data['sepal.width']
data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,1.457143
1,4.9,3.0,1.4,0.2,Setosa,1.633333
2,4.7,3.2,1.3,0.2,Setosa,1.468750
3,4.6,3.1,1.5,0.2,Setosa,1.483871
4,5.0,3.6,1.4,0.2,Setosa,1.388889
...,...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Virginica,2.233333
146,6.3,2.5,5.0,1.9,Virginica,2.520000
147,6.5,3.0,5.2,2.0,Virginica,2.166667
148,6.2,3.4,5.4,2.3,Virginica,1.823529


To add a new row, we pass keys and values of a dictionary and then concatenate it with our dataset:

In [19]:
new_row = pd.DataFrame({'sepal.length': [5.5], 'sepal.width': [3.0], 
                        'petal.length': [1.5], 'petal.width': [0.2],
                        'variety': ['Setosa'], 'sepal_ratio': [5.5/3.0]})
data = pd.concat([data, new_row], ignore_index=True)

In [20]:
data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,1.4,0.2,Setosa,1.457143
1,4.9,3.0,1.4,0.2,Setosa,1.633333
2,4.7,3.2,1.3,0.2,Setosa,1.468750
3,4.6,3.1,1.5,0.2,Setosa,1.483871
4,5.0,3.6,1.4,0.2,Setosa,1.388889
...,...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica,2.520000
147,6.5,3.0,5.2,2.0,Virginica,2.166667
148,6.2,3.4,5.4,2.3,Virginica,1.823529
149,5.9,3.0,5.1,1.8,Virginica,1.966667


### Updating and deleting data

We have already updated data above, let's try some more: 

In [21]:
data.loc[data['variety'] == 'Setosa', 'petal.length'] += 1 # this will add 1 to all values of petal.length for the Setosa variety
data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety,sepal_ratio
0,5.1,3.5,2.4,0.2,Setosa,1.457143
1,4.9,3.0,2.4,0.2,Setosa,1.633333
2,4.7,3.2,2.3,0.2,Setosa,1.468750
3,4.6,3.1,2.5,0.2,Setosa,1.483871
4,5.0,3.6,2.4,0.2,Setosa,1.388889
...,...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica,2.520000
147,6.5,3.0,5.2,2.0,Virginica,2.166667
148,6.2,3.4,5.4,2.3,Virginica,1.823529
149,5.9,3.0,5.1,1.8,Virginica,1.966667


If we don't want a column anymore, we can delete it:

In [22]:
data.drop('sepal_ratio', axis=1, inplace=True)
data.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,2.4,0.2,Setosa
1,4.9,3.0,2.4,0.2,Setosa
2,4.7,3.2,2.3,0.2,Setosa
3,4.6,3.1,2.5,0.2,Setosa
4,5.0,3.6,2.4,0.2,Setosa


If we don't want some rows, we can delete them like this:

In [23]:
data.drop(data.index[0:5], inplace=True)
data.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
5,5.4,3.9,2.7,0.4,Setosa
6,4.6,3.4,2.4,0.3,Setosa
7,5.0,3.4,2.5,0.2,Setosa
8,4.4,2.9,2.4,0.2,Setosa
9,4.9,3.1,2.5,0.1,Setosa


### Data wrangling

Wrangling is the process of cleaning, organizing, and transforming raw data into a format that's easier to analyze and use. There is a lot that you can do in data wrangling but we will only do some most-used operations.

Here is how you can filter rows based on values:

In [24]:
filtered_data = data[data['sepal.length'] > 5]
filtered_data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
5,5.4,3.9,2.7,0.4,Setosa
10,5.4,3.7,2.5,0.2,Setosa
14,5.8,4.0,2.2,0.2,Setosa
15,5.7,4.4,2.5,0.4,Setosa
16,5.4,3.9,2.3,0.4,Setosa
...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


Renaming columns for easier use:

In [25]:
data.rename(columns={'sepal.length': 'sep.len', 
                     'sepal.width': 'sep.wid',
                     'petal.length': 'pet.len',
                     'petal.width': 'pet.wid'},
           inplace=True)
data

Unnamed: 0,sep.len,sep.wid,pet.len,pet.wid,variety
5,5.4,3.9,2.7,0.4,Setosa
6,4.6,3.4,2.4,0.3,Setosa
7,5.0,3.4,2.5,0.2,Setosa
8,4.4,2.9,2.4,0.2,Setosa
9,4.9,3.1,2.5,0.1,Setosa
...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


One of the key steps in wrangling is handling missing values. You can see them as None or NaN in your table. These can sometimes cause some methods to return errors. There are very smart techniques that help you deal with those but here we will just remove them or fill with average values:

In [26]:
data.loc[5:10, 'sep.len'] = None # first we have to introduce some missing values cause this dataset does not have any
data

Unnamed: 0,sep.len,sep.wid,pet.len,pet.wid,variety
5,,3.9,2.7,0.4,Setosa
6,,3.4,2.4,0.3,Setosa
7,,3.4,2.5,0.2,Setosa
8,,2.9,2.4,0.2,Setosa
9,,3.1,2.5,0.1,Setosa
...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [27]:
data['sep.len'].fillna(data['sep.len'].mean(), inplace=True) # fill with mean values
data

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['sep.len'].fillna(data['sep.len'].mean(), inplace=True) # fill with mean values


Unnamed: 0,sep.len,sep.wid,pet.len,pet.wid,variety
5,5.914286,3.9,2.7,0.4,Setosa
6,5.914286,3.4,2.4,0.3,Setosa
7,5.914286,3.4,2.5,0.2,Setosa
8,5.914286,2.9,2.4,0.2,Setosa
9,5.914286,3.1,2.5,0.1,Setosa
...,...,...,...,...,...
146,6.300000,2.5,5.0,1.9,Virginica
147,6.500000,3.0,5.2,2.0,Virginica
148,6.200000,3.4,5.4,2.3,Virginica
149,5.900000,3.0,5.1,1.8,Virginica


In [28]:
data.loc[5:10, 'sep.len'] = None 

In [29]:
data.dropna(inplace=True) # remove rows with NaN values
data

Unnamed: 0,sep.len,sep.wid,pet.len,pet.wid,variety
11,4.8,3.4,2.6,0.2,Setosa
12,4.8,3.0,2.4,0.1,Setosa
13,4.3,3.0,2.1,0.1,Setosa
14,5.8,4.0,2.2,0.2,Setosa
15,5.7,4.4,2.5,0.4,Setosa
...,...,...,...,...,...
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


### Grouping and aggregation

It is sometimes useful to group data by a certain criterion, like so:

In [30]:
variety_mean = data.groupby('variety').mean()
print(variety_mean)

            sep.len  sep.wid  pet.len  pet.wid
variety                                       
Setosa        5.045    3.440    2.465   0.2525
Versicolor    5.936    2.770    4.260   1.3260
Virginica     6.588    2.974    5.552   2.0260


In [31]:
variety_stats = data.groupby('variety').agg(['mean', 'median', 'std']) # multiple statistics
print(variety_stats)

           sep.len                  sep.wid                  pet.len         \
              mean median       std    mean median       std    mean median   
variety                                                                       
Setosa       5.045   5.05  0.363000   3.440    3.4  0.397299   2.465   2.50   
Versicolor   5.936   5.90  0.516171   2.770    2.8  0.313798   4.260   4.35   
Virginica    6.588   6.50  0.635880   2.974    3.0  0.322497   5.552   5.55   

                     pet.wid                   
                 std    mean median       std  
variety                                        
Setosa      0.187494  0.2525    0.2  0.110911  
Versicolor  0.469911  1.3260    1.3  0.197753  
Virginica   0.551895  2.0260    2.0  0.274650  


In [32]:
variety_custom = data.groupby('variety').agg( # custom aggregation
    sep_len_mean=('sep.len', 'mean'), # mean of the column
    pet_len_range=('pet.len', lambda x: x.max() - x.min()) # difference with a lambda expression (see Part 1)
)
print(variety_custom)

            sep_len_mean  pet_len_range
variety                                
Setosa             5.045            0.9
Versicolor         5.936            2.1
Virginica          6.588            2.4


### Saving the data

To save your dataframe in a .csv file, run:

In [33]:
data.to_csv('iris_upd.csv', index=True)

### Exercises

Below are 3 exercises in order of increasing difficulty for the topics covered above:

#### 1.1. Exploratory data analysis

Load the toy_dataset into a pandas dataframe:

In [34]:
# load the data

Perform the following operations:

In [35]:
# display the first 5 rows

In [50]:
# check for missing values -> HINT: take a look at isnull() and sum() methods

In [37]:
# calculcate basic summary statistics for numerical columns

#### 1.2. Wrangling and aggregation

Perform the following:

In [57]:
# filter the dataframe to include only rows where the City is New York City or Los Angeles -> HINT: look at isin() method

In [59]:
# rename the column City to Location

In [63]:
# group the data by Gender and Lllness and calculate the number of individuals in each group -> HINT: look at size() method

In [41]:
# calculate the percentage of people who are Ill for each City

In [60]:
# identify the City with the highest average income

#### 1.3. Data analysis

Perform the following:

In [86]:
# add a new column called Income_level:
# -> if income is higher in the top 25% of earners in the dataset, label it high
# -> if income is lower than 25% of top earners but higher than 25% bottomg earners, label it medium
# -> otherwise, label it low

# HINT: look at quantile() method

In [84]:
# for each City, calculate the the average income per gender

In [121]:
# identify the City where Female have more medium labels in the Income_level column than Male -> HINT: look at unstack() method4

In [122]:
# save the final dataframe to a new csv file called toy_dataset_final.csv

### Solutions

Below are the solutions with comments to the exercises from above:

#### 1.1. Exploratory data analysis

In [48]:
# load the data
df = pd.read_csv('toy_dataset.csv') # it is very common to name your dataframe df if you are just working with one dataframe for a project

In [49]:
# display the first 5 rows
df.head() # 5 rows is default, so we do not have to specify the number in ()

Unnamed: 0,Number,City,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No


In [51]:
# check for missing values
missing_values = df.isnull().sum()
missing_values

Number     0
City       0
Gender     0
Age        0
Income     0
Illness    0
dtype: int64

In [52]:
# calculcate basic summary statistics for numerical columns
summary_stats = df.describe()
summary_stats

Unnamed: 0,Number,Age,Income
count,150000.0,150000.0,150000.0
mean,75000.5,44.9502,91252.798273
std,43301.414527,11.572486,24989.500948
min,1.0,25.0,-654.0
25%,37500.75,35.0,80867.75
50%,75000.5,45.0,93655.0
75%,112500.25,55.0,104519.0
max,150000.0,65.0,177157.0


Note that column number is just a number of the datapoint, so the statistics is not very informative.

#### 1.2. Wrangling and aggregation

In [56]:
# filter the dataframe to include only rows where the city is New York City or Los Angeles
filtered_df = df[df['City'].isin(['New York City', 'Los Angeles'])]
filtered_df

Unnamed: 0,Number,City,Gender,Age,Income,Illness
19707,19708,New York City,Male,49,112226.0,No
19708,19709,New York City,Male,42,110534.0,No
19709,19710,New York City,Female,61,100665.0,No
19710,19711,New York City,Female,58,98147.0,Yes
19711,19712,New York City,Female,43,93100.0,No
...,...,...,...,...,...,...
102182,102183,Los Angeles,Female,36,78898.0,No
102183,102184,Los Angeles,Male,40,95164.0,No
102184,102185,Los Angeles,Male,55,101259.0,No
102185,102186,Los Angeles,Female,62,90846.0,No


In [61]:
# rename the column City to Location
df.rename(columns={'City': 'Location'}, inplace=True)
df

Unnamed: 0,Number,Location,Gender,Age,Income,Illness
0,1,Dallas,Male,41,40367.0,No
1,2,Dallas,Male,54,45084.0,No
2,3,Dallas,Male,42,52483.0,No
3,4,Dallas,Male,40,40941.0,No
4,5,Dallas,Male,46,50289.0,No
...,...,...,...,...,...,...
149995,149996,Austin,Male,48,93669.0,No
149996,149997,Austin,Male,25,96748.0,No
149997,149998,Austin,Male,26,111885.0,No
149998,149999,Austin,Male,25,111878.0,No


In [73]:
# group the data by Gender and Lllness and calculate the number of individuals in each group
grouped_counts = df.groupby(['Gender', 'Illness']).size().reset_index(name='count')
grouped_counts

Unnamed: 0,Gender,Illness,count
0,Female,No,60869
1,Female,Yes,5331
2,Male,No,76992
3,Male,Yes,6808


In [71]:
# calculate the percentage of people who are Ill for each City
ill_counts = df[df['Illness'] == 'Yes'].groupby('Location').size() # first calculate the number of Ill people in each City
total_counts = df.groupby('Location').size() # then calculate the number of all people in a City 

ill_percentage = (ill_counts / total_counts * 100).reset_index(name='ill_percentage') # divie one by another and *100 to get %
ill_percentage

Unnamed: 0,Location,ill_percentage
0,Austin,8.224862
1,Boston,8.264065
2,Dallas,8.184909
3,Los Angeles,7.981848
4,Mountain View,8.284689
5,New York City,7.992923
6,San Diego,8.072116
7,Washington D.C.,8.226601


Here and in the question above, .reset_index(name='X') is just for cosmetic purposes, you can check that the code runs without it too:

In [74]:
grouped_counts = df.groupby(['Gender', 'Illness']).size()
grouped_counts

Gender  Illness
Female  No         60869
        Yes         5331
Male    No         76992
        Yes         6808
dtype: int64

In [75]:
ill_percentage = (ill_counts / total_counts * 100)
ill_percentage

Location
Austin             8.224862
Boston             8.264065
Dallas             8.184909
Los Angeles        7.981848
Mountain View      8.284689
New York City      7.992923
San Diego          8.072116
Washington D.C.    8.226601
dtype: float64

In [78]:
# identify the City with the highest average income
highest_avg_income = df.groupby('Location')['Income'].mean().idxmax()
print(f'The city with the highest average income is: {highest_avg_income}')

The city with the highest average income is: Mountain View


We can verify that it is correct: 

In [81]:
avg_income = df.groupby('Location')['Income'].mean()
avg_income

Location
Austin              90277.513423
Boston              91554.571497
Dallas              45252.231187
Los Angeles         95264.155410
Mountain View      135078.415782
New York City       96857.131393
San Diego          100756.209178
Washington D.C.     70991.612808
Name: Income, dtype: float64

#### 1.3. Data analysis

In [88]:
# add a new column called Income_level:
# -> if income is higher in the top 25% of earners in the dataset, label it high
# -> if income is lower than 25% of top earners but higher than 25% bottomg earners, label it medium
# -> otherwise, label it low

# we first calculate income quartiles
q75 = df['Income'].quantile(0.75) # top 25% earners
q25 = df['Income'].quantile(0.25) # top 75% earners

In [89]:
def categorize_income(income): # we define a function to assign values to the Income_level column
    if income > q75:
        return 'high'
    elif income > q25:
        return 'medium'
    else:
        return 'low'

In [90]:
df['Income_level'] = df['Income'].apply(categorize_income)

In [91]:
df

Unnamed: 0,Number,Location,Gender,Age,Income,Illness,Income_level
0,1,Dallas,Male,41,40367.0,No,low
1,2,Dallas,Male,54,45084.0,No,low
2,3,Dallas,Male,42,52483.0,No,low
3,4,Dallas,Male,40,40941.0,No,low
4,5,Dallas,Male,46,50289.0,No,low
...,...,...,...,...,...,...,...
149995,149996,Austin,Male,48,93669.0,No,medium
149996,149997,Austin,Male,25,96748.0,No,medium
149997,149998,Austin,Male,26,111885.0,No,high
149998,149999,Austin,Male,25,111878.0,No,high


In [92]:
# for each City, calculate the the average income per Gender
avg_income_by_gender = df.groupby(['Location', 'Gender'])['Income'].mean().reset_index()
avg_income_by_gender

Unnamed: 0,Location,Gender,Income
0,Austin,Female,84958.760631
1,Austin,Male,94424.246272
2,Boston,Female,85978.44253
3,Boston,Male,96071.649368
4,Dallas,Female,39602.533142
5,Dallas,Male,49722.384021
6,Los Angeles,Female,89732.511306
7,Los Angeles,Male,99681.616055
8,Mountain View,Female,129476.667463
9,Mountain View,Male,139504.523354


In [113]:
# identify the City where Female have more medium labels in the Income_level column than Male

mid_income = df[df['Income_level'] == 'medium'] # first we only filter the dataframe to include Income_level labelled high
mid_income_counts = mid_income.groupby(['Location', 'Gender']).size().unstack() # just like above, we group by Location and Gender
mid_income_counts

Gender,Female,Male
Location,Unnamed: 1_level_1,Unnamed: 2_level_1
Austin,3432.0,5226.0
Boston,2455.0,3360.0
Dallas,,12.0
Los Angeles,10574.0,11724.0
Mountain View,40.0,2.0
New York City,16789.0,16993.0
San Diego,1537.0,1331.0
Washington D.C.,205.0,1323.0


In [118]:
mid_income_counts['female_over_male'] = mid_income_counts['Female'] - mid_income_counts['Male'] # get difference
# get the city with the highest difference
city_with_highest_diff = mid_income_counts[mid_income_counts['female_over_male'] > 0].sort_values('female_over_male', ascending=False).head(1)
print(f'City with the highest difference is: {city_with_highest_diff.index[0]}')

City with the highest difference is: San Diego


You can how cave your dataframe: 

In [119]:
df.to_csv('toy_upd.csv', index=True)