# Combine many datasets

I'm looking for the best Metropolitan area in the U.S. where I can be a belly dancer. 
I'd like to either be employed or open a studio.
The problem is that it is impossible to predict the potential income of a belly dancer in each area due to lack or absent of such data. 
I tried hard to find relevant datasets as much as I can, and here's what I've found.

## Available data

### U.S. employee wage and population, per occupation/area/year

- Source
    - Occupational Employment and Wage Statistics provided by U.S. Bureau of Labor Statistics
    - Website: https://www.bls.gov/oes/tables.htm
- Included data
    - Hourly and annual wage
    - Numper of employee
    - Various occupations including dancer and other similar jobs
    - Statistics per entire U.S. and each Metripolitan area
    - Year coverage: from older than 2006 to 2021
- Why did I collected this dataset?
    - The most segmented and through dataset of employment
    - It has **per area** statistics of **money**.

### U.S. population dataset, per area/year
- Source
    - U.S. government census
    - Website 
        - https://www.census.gov/data/tables/time-series/demo/popest/2010s-total-metro-and-micro-statistical-areas.html
        - https://data.census.gov
- Included data
    - Population data per Metropolitan area and year
    - US entire population demographic data
- Why did I collected this dataset?
    - It could be used for estimate volume (price x population) of demand

### U.S. dance studio market size, per year
- Source
    - Statista
    - Website: https://www.statista.com/statistics/1175824/dance-studio-industry-market-size-us/
- Remarks
    - 3-4 billions
- Why did I collected this dataset?
    - The dance studio revenue data is what I'd like to know. Although what I eventually need is such data of each area, but overall U.S. statistics will be useful to see a trend over time and might be useful.

### U.S. fitness revenue, per year
- Source1
    - FRED
    - Website: https://fred.stlouisfed.org/series/REVEF71394ALLEST
    - Revenue of fitness and recreational sports centers    
- Source2
    - Statista
    - Website: https://www.statista.com/statistics/605223/us-fitness-health-club-market-size-2007-2021/
    - Revenue of fitness, health, and gym club industries
- Remarks
    - 10-35 billions
    - Those two number varies, but trends seem similar. However, should adjust with inflation.
- Why did I collected this dataset?
    - I guess fitness business will share damanding demography with dance studio.
    - Then it can be a cross check data for the dance studio market size data about market trend.
    
### U.S. performing art company revenue, per year
- Source
    - FRED
    - Website: https://fred.stlouisfed.org/series/REV7111AMSA
    - Revenue of performing arts companies
- Remarks
    - 2.5-4.4 billions
- Why did I collected this dataset?
    - Partial income of dance studio comes from performance. This data can be a good reference to check trend of performing art demand.


### U.S. dance studio owner's income in 2021
- Source
    - Blog (citing ZipRecruiter)
    - Website: https://www.thestudiodirector.com/blog/dance-studio-industry-stats/
- Remarks
    - There are 54,627 dance studios in the U.S.
    - Owner's average annual income: 43,048 dollars
    - **Competitors** of dance studio: **health club** that **offers dance classes** along with fitness programs
- Quick math
    - Total profit: 2.35 billion dollars (income x studio)
    

### U.S. performing arts statistics in 2011

- Source
    - Government data
    - Website: https://www.arts.gov/sites/default/files/102.pdf

- Remarks
    - 8,840 organizations
    - 127,648 paid workers
    - 13.6 billion dollars revenue

- Why did I collected this dataset?
    - To cross check with the FRED data above.
    - 13.6 billion dollars revenue from this government report varies a lot with the 3-4 billions from the FRED data.
    - In this report, donation and the income from non-profit companies are included. 

### U.S. Consumer Price Index (CPI) data, per area/year
- Source
    - U.S. Bureau of Labor Statistics
    - Website (example): https://www.bls.gov/regions/new-york-new-jersey/data/xg-tables/ro2xgcpiny1967.htm
- Remarks
    - Living in a big city cost 3 times than the U.S. average...
- Why did I collected this dataset?
    - To correct bias from inflation and different price levels between area

## What to do with this data?

The goal is to pick a Metropolitan area where I can expect the most profit as a dancer, 
i.e. I need **reveneu per area** data.
However, datasets which include per area information don't have revenue data.
On the other hand, revenue data is not available for per area.

The best thing I can do is to make a hypothesis to build a connection between those two kinds of data.
Before doing that, let's do a quick math to check a few things.

### Dance studio profit in percentaage of revenue

- Numbers
    - Total revenue in the U.S. (from statista): 3.72 billion in 2021 
    - Total profit (from a blog): 43k (owner's annual income) * 54.6k (the number of studios) = 2.35 billon dollars
- **Average profit** in percent: Profit/Revenue = 2.35/3.72 = **63\%** in 2021


## To do at this notebook session
Eventually, I'll build a model after explore multiple datasets written in the above.
To do so, first, I should build a nice dataset table by merging various data.
Here, I already prepared dozens of csv files under "data" directory:

### Data files

- {year}.csv: employment dataset of each year, per petropolitan area
- {yea}nat.csv: employment dataset of each year, entire U.S.
- rev.csv: hand typed revenue data of all business, with values from the websites above
- C{year}.csv: total population demographic dataset downloaded from the government census website
- census.csv: metropolitan population dataset downloaded from the government census website, after cutting unnecessary columns


## Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_colwidth = 200

# Employment statistics data

In [2]:
# read one excel file to see a structure
sample = pd.read_excel('data/2010.xls')

sample.info()
display(sample.sample(3))


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7940 entries, 0 to 7939
Data columns (total 25 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   PRIM_STATE    7940 non-null   object
 1   AREA          7940 non-null   int64 
 2   AREA_NAME     7940 non-null   object
 3   OCC_CODE      7940 non-null   object
 4   OCC_TITLE     7940 non-null   object
 5   GROUP         253 non-null    object
 6   TOT_EMP       7940 non-null   object
 7   EMP_PRSE      7940 non-null   object
 8   JOBS_1000     7940 non-null   object
 9   LOC QUOTIENT  7940 non-null   object
 10  H_MEAN        7940 non-null   object
 11  A_MEAN        7940 non-null   object
 12  MEAN_PRSE     7940 non-null   object
 13  H_PCT10       7940 non-null   object
 14  H_PCT25       7940 non-null   object
 15  H_MEDIAN      7940 non-null   object
 16  H_PCT75       7940 non-null   object
 17  H_PCT90       7940 non-null   object
 18  A_PCT10       7940 non-null   object
 19  A_PCT2

Unnamed: 0,PRIM_STATE,AREA,AREA_NAME,OCC_CODE,OCC_TITLE,GROUP,TOT_EMP,EMP_PRSE,JOBS_1000,LOC QUOTIENT,...,H_MEDIAN,H_PCT75,H_PCT90,A_PCT10,A_PCT25,A_MEDIAN,A_PCT75,A_PCT90,ANNUAL,HOURLY
1648,DC,47900,"Washington-Arlington-Alexandria, DC-VA-MD-WV",25-1032,"Engineering Teachers, Postsecondary",,**,**,**,**,...,*,*,*,37880,63080,80120,108380,145170,True,
6881,TX,19100,"Dallas-Fort Worth-Arlington, TX",37-3011,Landscaping and Groundskeeping Workers,,17940,4.7,6.333,0.971,...,10.43,13.09,15.97,16280,18080,21690,27220,33220,,
2790,FL,33100,"Miami-Fort Lauderdale-Pompano Beach, FL",51-6064,"Textile Winding, Twisting, and Drawing Out Machine Setters, Operators, and Tenders",,80,40.2,0.036,0.164,...,11.13,14.52,15.65,16790,18330,23160,30210,32550,,


In [3]:
# metropolitan area in interest
# LA has two zip codes because it has changed over years
zipcode_area = {31100:'LA',31080:'LA',41860:'SanFran',16980:'Chicago',35620:'NY',42660:'Seattle'}

# jobs in interest
    # dancer1: dancer, dancer2: choreographer, 
    # dancer: dancer1 plus dancer2
    # fitness1: fitness trainer/instructer, fitness2: recreational worker, 
    # fitness: fitness1 plus fitness2

code_job = {'27-2031':'dancer1','27-2032':'dancer2',
            '39-9031':'fitness1','39-9032':'fitness2',
            '00-0000':'all'}

In [4]:
%%script false --no-raise-error
# This block combines multiple wage data files, then generate a single csv file.
# If you already have data/wage.csv, this block can be skipped. It takes time to run.

df_save = []
for year in range(2006,2022):
    print(year)
    metro = None # metropolitan statistics data
    national = None # national statistics data
    try:
        metro = pd.read_excel('data/'+str(year)+'.xls')
        national = pd.read_excel('data/'+str(year)+'nat.xls')
    except:
        metro = pd.read_excel('data/'+str(year)+'.xlsx')
        national = pd.read_excel('data/'+str(year)+'nat.xlsx')
 
    metro.columns = metro.columns.str.strip().str.lower()
    national.columns = national.columns.str.strip().str.lower()
    
    # unify feature names in all years
    metro.rename(columns={'area_title':'area_name'},inplace=True)
        
    # LA area code changed
    area_la = 31100
    if year>2014:
        area_la=31080

    # Select metropolitan area in interest
    metro = metro.loc[(metro.area==area_la) | (metro.area==41860) | (metro.area==16980) | 
                    (metro.area==35620) | (metro.area==42660)]
    
    # Select occupation in interest
    metro = metro.loc[(metro.occ_code=='27-2031') | (metro.occ_code=='27-2032') | 
                      (metro.occ_code=='39-9031') | (metro.occ_code=='39-9032') | 
                      (metro.occ_code=='00-0000')]
    national = national.loc[(national.occ_code=='27-2031') | (national.occ_code=='27-2032') |
                            (national.occ_code=='39-9031') | (national.occ_code=='39-9032') |                          
                            (national.occ_code=='00-0000')]

    
    # Change zip code to the unique area names
    metro['area']=metro.apply(lambda x: zipcode_area[x['area']], axis=1)

    # To match columns with metropolitan dataframe
    national['area'] = 'All'
    national['area_name'] = 'U.S. all'
    
    # Keep only columns to use
    metro = metro[['area', 'area_name', 'occ_code', 'occ_title', 
       'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
       'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
       'a_median', 'a_pct75', 'a_pct90']]

    national = national[['area', 'area_name', 'occ_code', 'occ_title', 
   'tot_emp', 'emp_prse', 'h_mean', 'a_mean', 'mean_prse', 'h_pct10',
   'h_pct25', 'h_median', 'h_pct75', 'h_pct90', 'a_pct10', 'a_pct25',
   'a_median', 'a_pct75', 'a_pct90']]
    
    
    # comebine national data to metropolitan data
    metro = pd.concat([national,metro], ignore_index=True)

    # add year
    metro['year']=year
    
    # add the unique occupation name
    metro['occ']=metro.apply(lambda x: code_job[x['occ_code']], axis=1)

    # Cleaning
    metro.replace('**',np.nan,inplace=True)
    metro.replace('*',np.nan,inplace=True)
    
    # Append to a list to save
    df_save.append(metro)

# Mave a csv file
df_save = pd.concat(df_save)
df_save.to_csv('data/wage.csv', index=False)

In [5]:
# Check data is prepared as intended
# If no Nan entry, it is expected to have 6(area) x 5(jobs) x 16(years) = 480 records
df = pd.read_csv('data/wage.csv')

df.info()
df.sample(5)


# Confirm if city and occupation labels are correct
for x in zipcode_area.values():
    print(x,df[df.area==x].area_name.unique())
    
for x in code_job.values():
    print(x,df[df.occ==x].occ_title.unique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 21 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   area       458 non-null    object 
 1   area_name  458 non-null    object 
 2   occ_code   458 non-null    object 
 3   occ_title  458 non-null    object 
 4   tot_emp    426 non-null    float64
 5   emp_prse   426 non-null    float64
 6   h_mean     452 non-null    float64
 7   a_mean     369 non-null    float64
 8   mean_prse  452 non-null    float64
 9   h_pct10    452 non-null    float64
 10  h_pct25    452 non-null    float64
 11  h_median   452 non-null    float64
 12  h_pct75    452 non-null    float64
 13  h_pct90    452 non-null    float64
 14  a_pct10    369 non-null    float64
 15  a_pct25    369 non-null    float64
 16  a_median   369 non-null    float64
 17  a_pct75    369 non-null    float64
 18  a_pct90    369 non-null    float64
 19  year       458 non-null    int64  
 20  occ       

- Area and job names are correctly marked.
- 458 rows, fewer than expected 480 records. Let's take a look. 

In [6]:
# Remove duplicating data
df.drop(['area_name','occ_code','occ_title'],axis=1,inplace=True)


# Which record is missing?
for i in range(2006,2022):
    for j in df.area.unique():
        if df.loc[(df.area==j)&(df.year==i)].occ.nunique()!=5:
            print(i,j, set(df.occ.unique()) - set(df.loc[(df.area==j)&(df.year==i)].occ.unique())) 

2006 NY {'dancer1'}
2009 Seattle {'dancer2'}
2010 Seattle {'dancer1'}
2011 Seattle {'dancer1', 'dancer2'}
2012 Seattle {'dancer1', 'dancer2'}
2013 Seattle {'dancer1'}
2014 Seattle {'dancer1'}
2015 Seattle {'dancer1'}
2016 Seattle {'dancer1'}
2018 Seattle {'dancer2'}
2019 SanFran {'dancer2'}
2019 NY {'dancer2'}
2019 Seattle {'dancer2'}
2020 SanFran {'dancer2'}
2020 Chicago {'dancer2'}
2020 NY {'dancer2'}
2020 Seattle {'dancer2'}
2021 Chicago {'dancer1', 'dancer2'}
2021 Seattle {'dancer1'}


Some missing records of dancer's income data in multiple cities and years. 
Of course, dancers are hard to track!
Let's not bother to handle the missing data here for now.

## Census data

### Combine U.S. demograpic data of multiple years
Downloaded each file from: https://data.census.gov/cedsci/table?q=Age%20and%20Sex&tid=ACSDP1Y2010.DP05

In [7]:
# Check one example file
demo = pd.read_csv('data/C2015.csv')

demo.info()
display(demo.head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89 entries, 0 to 88
Data columns (total 5 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   Label (Grouping)                        89 non-null     object
 1   United States!!Estimate                 84 non-null     object
 2   United States!!Margin of Error          84 non-null     object
 3   United States!!Percent                  84 non-null     object
 4   United States!!Percent Margin of Error  84 non-null     object
dtypes: object(5)
memory usage: 3.6+ KB


Unnamed: 0,Label (Grouping),United States!!Estimate,United States!!Margin of Error,United States!!Percent,United States!!Percent Margin of Error
0,SEX AND AGE,,,,
1,Total population,321418821.0,*****,321418821,(X)
2,Male,158167834.0,"±31,499",49.2%,±0.1
3,Female,163250987.0,"±31,500",50.8%,±0.1
4,Under 5 years,19793807.0,"±16,520",6.2%,±0.1
5,5 to 9 years,20582473.0,"±62,124",6.4%,±0.1
6,10 to 14 years,20627389.0,"±58,029",6.4%,±0.1
7,15 to 19 years,21426912.0,"±30,132",6.7%,±0.1
8,20 to 24 years,22541077.0,"±30,660",7.0%,±0.1
9,25 to 34 years,43897832.0,"±33,513",13.7%,±0.1


In [8]:
%%script false --no-raise-error
# If you already have data/usDemo.csv, this block can be skipped.

# Combine multiple year files
df_save = []
for year in range(2010,2020):
    demo = pd.read_csv('data/C{0}.csv'.format(year))
    demo['year'] = year
    df_save.append(demo)
    
# Mave a csv file
df_save = pd.concat(df_save)

df_save.columns = ['label','estimate','estimate_err','pct','pct_err','year','estimate_err2']

display(df_save)

df_save.to_csv('data/usDemo.csv', index=False)

### Clean Metropolitan census data

In [9]:
# Check metropolitan area census dataset
census_metro = pd.read_csv('data/census.csv')

print(census_metro.info())
display(census_metro.head(3))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2797 entries, 0 to 2796
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CBSA               2797 non-null   int64  
 1   MDIV               141 non-null    float64
 2   STCOU              1840 non-null   float64
 3   NAME               2797 non-null   object 
 4   LSAD               2797 non-null   object 
 5   CENSUS2010POP      2797 non-null   int64  
 6   ESTIMATESBASE2010  2797 non-null   int64  
 7   POPESTIMATE2010    2797 non-null   int64  
 8   POPESTIMATE2011    2797 non-null   int64  
 9   POPESTIMATE2012    2797 non-null   int64  
 10  POPESTIMATE2013    2797 non-null   int64  
 11  POPESTIMATE2014    2797 non-null   int64  
 12  POPESTIMATE2015    2797 non-null   int64  
 13  POPESTIMATE2016    2797 non-null   int64  
 14  POPESTIMATE2017    2797 non-null   int64  
 15  POPESTIMATE2018    2797 non-null   int64  
 16  POPESTIMATE2019    2797 

Unnamed: 0,CBSA,MDIV,STCOU,NAME,LSAD,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019
0,10180,,,"Abilene, TX",Metropolitan Statistical Area,165252,165252,165585,166634,167442,167473,168342,169688,170017,170429,171150,172060
1,10180,,48059.0,"Callahan County, TX",County or equivalent,13544,13545,13512,13511,13488,13502,13505,13589,13789,13968,13990,13943
2,10180,,48253.0,"Jones County, TX",County or equivalent,20202,20192,20238,20270,19870,20044,19850,19966,19971,19827,19866,20083


In [10]:
# Select only interesting area
lst=[]
for i in zipcode_area.keys():
    if i==31100:
        continue
    lst.append(census_metro.loc[(census_metro.CBSA==i)&(census_metro.LSAD=='Metropolitan Statistical Area')])

census_metro = pd.concat(lst)

# Change format
census_metro = census_metro[list(census_metro.columns[7:17])+[census_metro.columns[0]]]

display(census_metro)

Unnamed: 0,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019,CBSA
850,12838417,12925753,13013443,13097434,13166609,13234696,13270694,13278000,13249879,13214799,31080
1307,4343634,4395725,4455473,4519636,4584981,4647924,4688198,4712421,4726314,4731803,41860
291,9470634,9500870,9528090,9550194,9560430,9552554,9533662,9514113,9484158,9458539,16980
1017,18923407,19052774,19149689,19226449,19280929,19320968,19334778,19322607,19276644,19216182,35620
1337,3449241,3503891,3558829,3612347,3675160,3739654,3816355,3885579,3935179,3979845,42660


### Merge all population data (U.S. census, Metropolitan census) to Employment dataframe (df)

In [11]:
df['population'] = np.nan

# merge metropolitan census data
for year in range(2010,2020):
    
    for zipcode in census_metro.CBSA.unique():
                 
        df.loc[(df.area==zipcode_area[zipcode])&(df.year==year),['population']]=\
            census_metro[census_metro.CBSA==zipcode]['POPESTIMATE{0}'.format(year)].squeeze()

In [12]:
# merge U.S. entire population data
demo = pd.read_csv('data/usDemo.csv')

year_pop = {}
    # make year-population dictionary
for year in range(2010,2020):
    pop = int(''.join(demo[(demo.label.str.contains('Total population'))&(demo.year==year)].iloc[0].estimate.split(',')))
    year_pop[year]=pop

for year in range(2010,2020):
    df.loc[(df.area=='All')&(df.year==year),['population']] = year_pop[year]

## Revenue data

In [13]:
rev = pd.read_csv('data/rev.csv') # Manually organized yearly revenue data

# fitness: Statista fitness data
# fitness_fred: FRED fitness data
# perform_fred: FRED performing art data
# dance_studio: Statista dance studio data

rev['year']=pd.to_numeric(rev.DATE.str[:4])
# Set everythin in billion dollars
rev['fitness_fred'] = rev['fitness_fred']/1000.
rev['perform_fred'] = rev['perform_fred']/1000.
rev.drop(['DATE'],inplace=True,axis=1)

display(rev)

Unnamed: 0,fitness_fred,perform_fred,dance_studio,fitness,year
0,10.797,,,,1998
1,11.777,,,,1999
2,12.543,,,,2000
3,13.542,,,,2001
4,14.987,,,,2002
5,16.287,,,,2003
6,17.174,,,,2004
7,18.286,,,,2005
8,19.447,,,,2006
9,21.416,,,,2007


### Merge revenue data into df

In [14]:
df['dance_studio'] = np.nan
df['fitness'] = np.nan
df['fitness_fred'] = np.nan
df['perform_fred'] = np.nan


for year in range(2006, 2022):
    
    df.loc[df.year==year,['dance_studio']] = rev[rev.year==year]['dance_studio'].squeeze()
    df.loc[df.year==year,['fitness']] = rev[rev.year==year]['fitness'].squeeze()
    df.loc[df.year==year,['fitness_fred']] = rev[rev.year==year]['fitness_fred'].squeeze()
    df.loc[df.year==year,['perform_fred']] = rev[rev.year==year]['perform_fred'].squeeze()

## CPI data

In [15]:
cpi = pd.read_csv('data/cpi.csv')

df['cpi'] = np.nan

# merge CPI data
for year in range(2006,2022):    
    
    for area in cpi.columns[1:]:
        
        df.loc[(df.area==area)&(df.year==year),['cpi']]= \
            cpi[cpi.year==year][area].squeeze()

In [16]:
display(df.sample(10))

df.to_csv('data/dance.csv', index=False)



Unnamed: 0,area,tot_emp,emp_prse,h_mean,a_mean,mean_prse,h_pct10,h_pct25,h_median,h_pct75,...,a_pct75,a_pct90,year,occ,population,dance_studio,fitness,fitness_fred,perform_fred,cpi
51,NY,,,31.36,65230.0,16.8,13.66,20.63,25.68,30.93,...,64330.0,142650.0,2007,dancer2,,,,21.416,,226.940083
105,Chicago,330.0,24.4,15.48,,11.4,9.68,10.45,12.09,18.83,...,,,2009,dancer1,,,,21.842,13.888,209.995083
414,LA,5822510.0,0.3,30.61,63660.0,0.7,13.24,14.94,21.9,37.76,...,78540.0,125220.0,2020,all,,3.43,32.52,24.361,10.147,278.567
383,Chicago,4676440.0,0.4,27.48,57160.0,0.8,10.33,13.36,20.43,34.46,...,71680.0,107140.0,2019,all,9458539.0,4.2,37.46,35.889,17.522,241.180667
245,SanFran,5030.0,7.8,27.86,57960.0,4.4,10.78,15.22,27.23,36.59,...,76100.0,103450.0,2014,fitness1,4584981.0,3.42,31.76,27.001,14.061,252.2595
190,Chicago,4276280.0,0.4,23.62,49120.0,0.9,9.01,10.96,17.71,29.88,...,62150.0,92450.0,2012,all,9528090.0,3.22,29.65,24.051,13.921,222.00475
82,NY,19010.0,5.1,23.71,49310.0,3.1,8.68,12.46,22.36,31.58,...,65680.0,88670.0,2008,fitness1,,,,22.339,,235.782417
180,LA,5282160.0,0.5,25.0,51990.0,0.6,9.01,11.32,18.24,31.8,...,66150.0,101280.0,2012,all,13013443.0,3.22,29.65,24.051,13.921,236.648
424,SanFran,710.0,41.8,22.22,,7.6,15.43,16.37,17.95,20.89,...,,,2020,dancer1,,3.43,32.52,24.361,10.147,300.443667
197,NY,150.0,11.3,31.46,65450.0,7.3,15.56,20.72,30.48,44.86,...,93310.0,109470.0,2012,dancer2,19149689.0,3.22,29.65,24.051,13.921,252.588333


## All done!
Now let's explore our prepared dataset.