### Grouping and Aggregating



In [3]:
import pandas as pd
survey = pd.read_csv("Data/survey_results_public.csv")
survey.head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
0,1,I am a student who is learning to code,Yes,Never,The quality of OSS and closed source software ...,"Not employed, and not looking for work",United Kingdom,No,Primary/elementary school,,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,14.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
1,2,I am a student who is learning to code,No,Less than once per year,The quality of OSS and closed source software ...,"Not employed, but looking for work",Bosnia and Herzegovina,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,19.0,Man,No,Straight / Heterosexual,,No,Appropriate in length,Neither easy nor difficult
2,3,"I am not primarily a developer, but I write co...",Yes,Never,The quality of OSS and closed source software ...,Employed full-time,Thailand,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)",Web development or web design,...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,28.0,Man,No,Straight / Heterosexual,,Yes,Appropriate in length,Neither easy nor difficult
3,4,I am a developer by profession,No,Never,The quality of OSS and closed source software ...,Employed full-time,United States,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech articles written by other developers;Indu...,22.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Easy
4,5,I am a developer by profession,Yes,Once a month or more often,"OSS is, on average, of HIGHER quality than pro...",Employed full-time,Ukraine,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Computer science, computer engineering, or sof...",...,Just as welcome now as I felt last year,Tech meetups or events in your area;Courses on...,30.0,Man,No,Straight / Heterosexual,White or of European descent;Multiracial,No,Appropriate in length,Easy


### Count funtion

Count non-NA cells for each column or row. Note that `Shape` attribute returns Number rows and columns (including NA elements) 

In [15]:
print(survey.shape)
print(survey["SocialMedia"].count())

(88883, 85)
84437


### Series.value_counts

* Return a Series containing counts of unique values.
* If "normalize = True" then the object returned will contain the percentage of the unique values.

In [17]:
print(survey["SocialMedia"].value_counts())

Reddit                      14374
YouTube                     13830
WhatsApp                    13347
Facebook                    13178
Twitter                     11398
Instagram                    6261
I don't use social media     5554
LinkedIn                     4501
WeChat 微信                     667
Snapchat                      628
VK ВКонта́кте                 603
Weibo 新浪微博                     56
Youku Tudou 优酷                 21
Hello                          19
Name: SocialMedia, dtype: int64


In [18]:
print(survey["SocialMedia"].value_counts(normalize = True))

Reddit                      0.170233
YouTube                     0.163791
WhatsApp                    0.158071
Facebook                    0.156069
Twitter                     0.134988
Instagram                   0.074150
I don't use social media    0.065777
LinkedIn                    0.053306
WeChat 微信                   0.007899
Snapchat                    0.007437
VK ВКонта́кте               0.007141
Weibo 新浪微博                  0.000663
Youku Tudou 优酷              0.000249
Hello                       0.000225
Name: SocialMedia, dtype: float64


`groupby()` funtion

If we want to see which social media platform is mostly used in each country we need to group our data on country.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results.

In [22]:
# groupby function returns a groupby object

country_grp = survey.groupby("Country")
country_grp

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000227F0E33148>

In [24]:
# to get specific group from the groupby object

country_grp.get_group("Turkey").head()

Unnamed: 0,Respondent,MainBranch,Hobbyist,OpenSourcer,OpenSource,Employment,Country,Student,EdLevel,UndergradMajor,...,WelcomeChange,SONewContent,Age,Gender,Trans,Sexuality,Ethnicity,Dependents,SurveyLength,SurveyEase
27,28,I am a developer by profession,Yes,Less than once per year,The quality of OSS and closed source software ...,Employed full-time,Turkey,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","A business discipline (ex. accounting, finance...",...,,,,Man,No,,,,,
122,123,I am a developer by profession,Yes,Less than once a month but more than once per ...,The quality of OSS and closed source software ...,Employed full-time,Turkey,No,"Bachelor’s degree (BA, BS, B.Eng., etc.)","Another engineering discipline (ex. civil, ele...",...,A lot more welcome now than last year,Tech articles written by other developers;Indu...,24.0,Man,No,Bisexual,,No,Too long,Easy
238,240,I am a student who is learning to code,Yes,Never,"OSS is, on average, of LOWER quality than prop...","Not employed, and not looking for work",Turkey,"Yes, full-time","Secondary school (e.g. American high school, G...",,...,A lot more welcome now than last year,Tech articles written by other developers;Indu...,21.0,Man,No,Straight / Heterosexual,,No,Too long,Neither easy nor difficult
632,635,I am a developer by profession,No,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Independent contractor, freelancer, or self-em...",Turkey,No,"Master’s degree (MA, MS, M.Eng., MBA, etc.)",Mathematics or statistics,...,Somewhat less welcome now than last year,,36.0,Woman,No,,White or of European descent,Yes,Appropriate in length,Difficult
670,673,I code primarily as a hobby,Yes,Less than once per year,"OSS is, on average, of HIGHER quality than pro...","Not employed, and not looking for work",Turkey,No,Primary/elementary school,,...,Somewhat more welcome now than last year,Tech articles written by other developers;Indu...,13.0,Man,No,Straight / Heterosexual,White or of European descent,No,Appropriate in length,Neither easy nor difficult


In [38]:
# for the country group we can look for a specific column's unique values
# It returns a two index series

country_grp["SocialMedia"].value_counts()

Country      SocialMedia             
Afghanistan  Facebook                    15
             YouTube                      9
             I don't use social media     6
             WhatsApp                     4
             Instagram                    1
                                         ..
Zimbabwe     Facebook                     3
             YouTube                      3
             Instagram                    2
             LinkedIn                     2
             Reddit                       1
Name: SocialMedia, Length: 1220, dtype: int64

In [39]:
# we can look one column's unique values for a specific part of the group 

country_grp.get_group("Turkey")["SocialMedia"].value_counts()

WhatsApp                    233
Twitter                     167
YouTube                     156
Instagram                   117
Reddit                       98
LinkedIn                     73
I don't use social media     41
Facebook                     29
Youku Tudou 优酷                1
Name: SocialMedia, dtype: int64

In [42]:
# Alternatively

country_grp["SocialMedia"].value_counts().loc["Turkey"]

SocialMedia
WhatsApp                    233
Twitter                     167
YouTube                     156
Instagram                   117
Reddit                       98
LinkedIn                     73
I don't use social media     41
Facebook                     29
Youku Tudou 优酷                1
Name: SocialMedia, dtype: int64

In [44]:
# meadian salaries for country group 

country_grp["ConvertedComp"].median().loc["Turkey"]

17280.0

In [51]:
# meadian and mean salaries for country group 
# returns a dataframe

country_grp["ConvertedComp"].agg(["median", "mean"])

Unnamed: 0_level_0,median,mean
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Afghanistan,6222.0,101953.333333
Albania,10818.0,21833.700000
Algeria,7878.0,34924.047619
Andorra,160931.0,160931.000000
Angola,7764.0,7764.000000
...,...,...
"Venezuela, Bolivarian Republic of...",6384.0,14581.627907
Viet Nam,11892.0,17233.436782
Yemen,11940.0,16909.166667
Zambia,5040.0,10075.375000


In [86]:
# finding out how many people know Python in a specific country

filt = survey["Country"] =="Turkey"
survey.loc[filt, "LanguageWorkedWith"].str.contains("Python").sum()

371

In [61]:
# do the same thing with country group

country_grp.get_group("Turkey")["LanguageWorkedWith"].str.contains("Python").sum()

371

In [62]:
# who knows python in a specific country 

country_grp.get_group("Turkey")["LanguageWorkedWith"].str.contains("Python").value_counts()

False    553
True     371
Name: LanguageWorkedWith, dtype: int64

In [85]:
# What percentage of people from each country know Python?

county_python_pct = country_grp["LanguageWorkedWith"].apply(lambda x: x.str.contains("Python", na=False).value_counts(normalize=True))
county_python_pct.rename({False:'Don\'t know', True:'I know'}, inplace=True)
county_python_pct

Country                
Afghanistan  Don't know    0.818182
             I know        0.181818
Albania      Don't know    0.732558
             I know        0.267442
Algeria      Don't know    0.701493
                             ...   
Yemen        I know        0.157895
Zambia       Don't know    0.666667
             I know        0.333333
Zimbabwe     Don't know    0.641026
             I know        0.358974
Name: LanguageWorkedWith, Length: 337, dtype: float64

Source:
* [Corey Schafer - Python Pandas Tutorial](https://www.youtube.com/watch?v=ZyhVh-qRZPA&list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS&index=1)