# Summarizing Data Basics

In this notebook, we'll use [all-ages.csv](all-ages.csv) and [recent-grads.csv](recent-grads.csv) datasets to do some minor data analysis, mainly summarizing and statistics. The datasets are derived from the 2010-2012 American Community Surveys.

---
#### Loading data

In [1]:
import pandas as pd

# Loading the datasets
grads = pd.read_csv("recent-grads.csv")
ages = pd.read_csv("all-ages.csv")

In [2]:
# Let's have a look at them

grads

Unnamed: 0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972
5,6,2418,NUCLEAR ENGINEERING,Engineering,2573,17,2200,373,0.144967,1857,...,264,1449,400,0.177226,65000,50000,102000,1142,657,244
6,7,6202,ACTUARIAL SCIENCE,Business,3777,51,832,960,0.535714,2912,...,296,2482,308,0.095652,62000,53000,72000,1768,314,259
7,8,5001,ASTRONOMY AND ASTROPHYSICS,Physical Sciences,1792,10,2110,1667,0.441356,1526,...,553,827,33,0.021167,62000,31500,109000,972,500,220
8,9,2414,MECHANICAL ENGINEERING,Engineering,91227,1029,12953,2105,0.139793,76442,...,13101,54639,4650,0.057342,60000,48000,70000,52844,16384,3253
9,10,2408,ELECTRICAL ENGINEERING,Engineering,81527,631,8407,6548,0.437847,61928,...,12695,41413,3895,0.059174,60000,45000,72000,45829,10874,3170


In [3]:
ages

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000.0
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.049188,62000,38500,90000.0
5,1105,PLANT SCIENCE AND AGRONOMY,Agriculture & Natural Resources,79409,63043,51077,2070,0.031791,50000,35000,75000.0
6,1106,SOIL SCIENCE,Agriculture & Natural Resources,6586,4926,4042,264,0.050867,63000,39400,88000.0
7,1199,MISCELLANEOUS AGRICULTURE,Agriculture & Natural Resources,8549,6392,5074,261,0.039230,52000,35000,75000.0
8,1301,ENVIRONMENTAL SCIENCE,Biology & Life Science,106106,87602,65238,4736,0.051290,52000,38000,75000.0
9,1302,FORESTRY,Agriculture & Natural Resources,69447,48228,39613,2144,0.042563,58000,40500,80000.0


---
#### Summarize the total number of grads under each major category

In [4]:
# To get the unique values of certain column (Series) in DataFrame
unique_grads = grads["Major_category"].unique()

# We may loop over those values in order to get the sum for each catagory
# this is for "Engineering"
engineerings = grads[grads["Major_category"] == grads["Major_category"].unique()[0]]["Total"].sum() 

# Let's do this in more effective way, and have it inside a dictionary
majors = dict()

for major in grads["Major_category"].unique():
    majors[major] = grads[grads["Major_category"] == major]["Total"].sum()
    
majors

{'Agriculture & Natural Resources': 79981,
 'Arts': 357130,
 'Biology & Life Science': 453862,
 'Business': 1302376,
 'Communications & Journalism': 392601,
 'Computers & Mathematics': 299008,
 'Education': 559129,
 'Engineering': 537583,
 'Health': 463230,
 'Humanities & Liberal Arts': 713468,
 'Industrial Arts & Consumer Services': 229792,
 'Interdisciplinary': 12296,
 'Law & Public Policy': 179107,
 'Physical Sciences': 185479,
 'Psychology & Social Work': 481007,
 'Social Science': 529966}

---
#### Let's find out low paid jobs rate %

In [5]:
low_wage = grads["Low_wage_jobs"].sum() / grads["Total"].sum()
low_wage

0.09852546076122913

---
#### Let's compare the unemployment rate for both grads and other dataset ages (which is all population)

In [6]:
majors = grads["Major"].unique()

lower_count = 0

for m in majors:
    x = grads[grads["Major"] == m]
    
    if grads[grads["Major"] == m].iloc[0]["Unemployment_rate"] < ages[ages["Major"] == m].iloc[0]["Unemployment_rate"]:
        lower_count+= 1
        
lower_count

43

---
#### Let's get stats for the ages DataFrame

In [7]:
ages.describe()

Unnamed: 0,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,3879.815029,230256.6,166162.0,126307.8,9725.034682,0.057355,56816.184971,38697.109827,82506.358382
std,1687.75314,422068.5,307324.4,242425.4,18022.040192,0.019177,14706.226865,9414.524761,20805.330126
min,1100.0,2396.0,1492.0,1093.0,0.0,0.0,35000.0,24900.0,45800.0
25%,2403.0,24280.0,17281.0,12722.0,1101.0,0.046261,46000.0,32000.0,70000.0
50%,3608.0,75791.0,56564.0,39613.0,3619.0,0.054719,53000.0,36000.0,80000.0
75%,5503.0,205763.0,142879.0,111025.0,8862.0,0.069043,65000.0,42000.0,95000.0
max,6403.0,3123510.0,2354398.0,1939384.0,147261.0,0.156147,125000.0,78000.0,210000.0


---
#### "Total" column from ages DataFrame sum calue

In [8]:
ages.Total.sum()

39834398

---
#### Get random sample from ages DataFrame

In [9]:
ages.sample(10)

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
164,6206,MARKETING AND MARKETING RESEARCH,Business,1114624,890125,704912,51839,0.055033,56000,38500,90000.0
76,3402,HUMANITIES,Humanities & Liberal Arts,46188,29971,19460,2530,0.077844,46700,30000,70000.0
93,3801,MILITARY TECHNOLOGIES,Industrial Arts & Consumer Services,4315,1650,1708,187,0.101796,64000,39750,90000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
8,1301,ENVIRONMENTAL SCIENCE,Biology & Life Science,106106,87602,65238,4736,0.05129,52000,38000,75000.0
114,5200,PSYCHOLOGY,Psychology & Social Work,1484075,1055854,736817,79066,0.069667,45000,31000,68000.0
134,5599,MISCELLANEOUS SOCIAL SCIENCES,Social Science,15882,12307,9444,708,0.054399,52000,40000,80000.0
74,3302,COMPOSITION AND RHETORIC,Humanities & Liberal Arts,59211,44913,29628,3569,0.073615,40000,28800,65000.0
139,6001,DRAMA AND THEATER ARTS,Arts,174817,135071,81519,11789,0.080274,42000,29000,62000.0
43,2403,ARCHITECTURAL ENGINEERING,Engineering,19587,13713,11180,1017,0.069043,78000,50000,102000.0


---
---
### That's it!

Prepared by Issam Hijazi

https://ae.linkedin.com/in/ihijazi

@iHijazi