# Lab Assignment 10: Exploratory Data Analysis, Part 1
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

In this lab, you will be working with the 2018 [General Social Survey (GSS)](http://www.gss.norc.org/). The GSS is a sociological survey created and regularly collected since 1972 by the National Opinion Research Center at the University of Chicago. It is funded by the National Science Foundation. The GSS collects information and keeps a historical record of the concerns, experiences, attitudes, and practices of residents of the United States, and it is one of the most important data sources for the social sciences. 

The data includes features that measure concepts that are notoriously difficult to ask about directly, such as religion, racism, and sexism. The data also include many different metrics of how successful a person is in his or her profession, including income, socioeconomic status, and occupational prestige. These occupational prestige scores are coded separately by the GSS.  The full description of their methodology for measuring prestige is available here: http://gss.norc.org/Documents/reports/methodological-reports/MR122%20Occupational%20Prestige.pdf Here's a quote to give you an idea about how these scores are calculated:

> Respondents then were given small cards which each had a single occupational titles listed on it. Cards were in English or Spanish. They were given one card at a time in the preordained order. The interviewer then asked the respondent to "please put the card in the box at the top of the ladder if you think that occupation has the highest possible social standing. Put it in the box of the bottom of the ladder if you think it has the lowest possible social standing. If it belongs somewhere in between, just put it in the box that matches the social standing of the occupation."

The prestige scores are calculated from the aggregated rankings according to the method described above.

### Problem 0
Import the following packages:

In [1]:
import numpy as np
import pandas as pd
import sidetable
import weighted # this is a module of wquantiles, so type pip install wquantiles or conda install wquantiles to get access to it
from scipy import stats 
from sklearn import manifold
from sklearn import metrics
import prince
from pandas_profiling import ProfileReport
pd.options.display.max_columns = None

  from pandas_profiling import ProfileReport


Then load the GSS data with the following code:

In [2]:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                                               'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"])

### Problem 1
Drop all columns except for the following:
* `id` - a numeric unique ID for each person who responded to the survey
* `wtss` - survey sample weights
* `sex` - male or female
* `educ` - years of formal education
* `region` - region of the country where the respondent lives
* `age` - age
* `coninc` - the respondent's personal annual income
* `prestg10` - the respondent's occupational prestige score, as measured by the GSS using the methodology described above
* `mapres10` - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described above
* `papres10` -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described above
* `sei10` - an index measuring the respondent's socioeconomic status
* `satjob` - responses to "On the whole, how satisfied are you with the work you do?"
* `fechld` - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* `fefam` - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* `fepol` - agree or disagree with: "Most men are better suited emotionally for politics than are most women."
* `fepresch` - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."
* `meovrwrk` - agree or disagree with: "Family life often suffers because men concentrate too much on their work."

Then rename any columns with names that are non-intuitive to you to more intuitive and descriptive ones. Finally, replace the "89 or older" values of `age` with 89, and convert `age` to a float data type. [1 point]

In [3]:
gss = gss.loc[:,["id", "wtss", "sex", 
              "educ", "region", "age",
              "coninc", "prestg10", "mapres10",
              "papres10", "sei10", "satjob",
              "fechld", "fefam", "fepol", 
              "fepresch", "meovrwrk"]]
gss.head(2)

Unnamed: 0,id,wtss,sex,educ,region,age,coninc,prestg10,mapres10,papres10,sei10,satjob,fechld,fefam,fepol,fepresch,meovrwrk
0,1,2.357493,male,14.0,new england,43,,47.0,31.0,45.0,65.3,very satisfied,strongly agree,disagree,agree,strongly disagree,agree
1,2,0.942997,female,10.0,new england,74,22782.5,22.0,32.0,39.0,14.8,,,,,,


In [4]:
gss.columns = ['id', 'samp_wt', 'sex', 'educ', 'region', 'age', 'annual_income', 'pres10',
       'mapres10', 'papres10', 'socecon', 'satjob', 'sahm_con', 'trad_household', 'trad_poli',
       'sahm_preschool', 'abs_father']

In [5]:
gss.age.unique()

array(['43', '74', '42', '63', '71', '67', '59', '62', '55', '34', '61',
       '44', '41', '75', '30', '40', '29', '37', '56', '82', '68', '20',
       '89 or older', '60', '65', '45', '50', '52', '46', '53', '22',
       '33', '23', '28', '27', '64', '79', '32', '35', '21', '47', '70',
       '77', '69', '48', '81', '78', '54', '58', '76', '39', '38', '25',
       '49', '18', '19', '26', '57', '51', '36', '72', '24', '88', '66',
       '84', '80', '31', '83', '73', '86', nan, '85', '87'], dtype=object)

In [6]:
gss.age = gss.age.replace('89 or older', '89').astype(float)
gss.age.unique()

array([43., 74., 42., 63., 71., 67., 59., 62., 55., 34., 61., 44., 41.,
       75., 30., 40., 29., 37., 56., 82., 68., 20., 89., 60., 65., 45.,
       50., 52., 46., 53., 22., 33., 23., 28., 27., 64., 79., 32., 35.,
       21., 47., 70., 77., 69., 48., 81., 78., 54., 58., 76., 39., 38.,
       25., 49., 18., 19., 26., 57., 51., 36., 72., 24., 88., 66., 84.,
       80., 31., 83., 73., 86., nan, 85., 87.])

### Problem 2
#### Part a
Use the `ProfileReport()` function to generate and embed an HTML formatted exploratory data analysis report in your notebook. Make sure that it includes a "Correlations" report along with "Overview" and "Variables". [1 point]

prof = ProfileReport(gss, 
                       title = 'General Social Survey EDA',
                       html = {'style': {'full_width': True}},
                       minimal = False)
prof.to_notebook_iframe()

#### Part b
Looking through the HTML report you displayed in part a, how many people in the data are from New England? [1 point]

#### Answer
124 ppl or 5.3% of the ppl responding have selected New England as their region.

#### Part c
Looking through the HTML report you displayed in part a, which feature in the data has the highest number of missing values, and what percent of the values are missing for this feature? [1 point]

#### Answer
The trad_poli feature has the highest number of missing values with 36.2% of the values missing.

#### Part d
Looking through the HTML report you displayed in part a, which two distinct features in the data have the highest correlation? [1 point]

#### Answer
The two features, socecon and pres10, have the highest correlation.

### Problem 3
On a primetime show on a 24-hour cable news network, two unpleasant-looking men in suits sit across a table from each other, scowling. One says "This economy is failing the middle-class. The average American today is making less than \\$48,000 a year." The other screams "Fake news! The typical American makes more than \$55,000 a year!" Explain, using words and code, how the data can support both of their arguments. Use the sample weights to calculate descriptive statistics that are more representative of the American adult population as a whole. [1 point]

#### Answer

For the first man, "This economy is failing the middle-class. The average American today is making less than $48,000 a year.":

In [15]:
stats.trim_mean(gss["annual_income"].dropna(), 0.05)

46657.47656697627

If we calculate the mean excluding the top 5% and bottom 5% (theoretically excluding the extremely rich and poor and theoretically emcompasses the 'average American' supposedly represented in the dataset), it shows that the 'average American' is making about 46K-47K USD annually, which would support the first man's statement.

For the second man, "Fake news! The typical American makes more than $55,000 a year!":

In [31]:
# df.groupby(df.index).apply(lambda x: np.average(x.wt, weights=x.value))
np.average(gss.annual_income.dropna(), weights=gss.dropna(subset='annual_income').samp_wt)

55158.96280421564

If we calculate the mean based on the weight of the sample, then we can have a more balanced/representative view of the average annual income. Calculating the average annual income using sample weights can allow us to balance out groups that are either over represented or underepresented. It shows that the 'average AMerican' is making about 55-56k USD annually, which would support the second man's statement.

### Problem 4
For each of the following parts, 
* generate a table that provides evidence about the relationship between the two features in the data that are relevant to each question, 
* interpret the table in words, 
* use a hypothesis test to assess the strength of the evidence in the table, 
* and provide a **specific and accurate** intepretation of the $p$-value associated with this hypothesis test beyond "significant or not". 

#### Part a
Is there a gender wage gap? That is, is there a difference between the average incomes of men and women? [2 points]

In [70]:
## table
gss.groupby('sex').agg({'annual_income':'mean'})

Unnamed: 0_level_0,annual_income
sex,Unnamed: 1_level_1
female,47191.021452
male,53314.626187


#### Interpretation
If we look at the average anuual income for females and compare it to male average income, there seems to be a difference between incomes. Females seem to earn a lower income compared to males, providing support that there is probably a gender wage gap.

In [74]:
#hypo test
anovat = gss[["sex", "annual_income"]].dropna()
stats.ttest_ind(anovat.annual_income[anovat.sex=='female'], 
                anovat.annual_income[anovat.sex=='male'],
               equal_var=False)

TtestResult(statistic=-3.332824087618215, pvalue=0.0008749557881530089, df=2053.1579577339658)

#### Interpretation of p-val, 0.0008749557881530089
H0: The difference between wages is not different based on gender

HA: The difference between wages is different based on gender

Conclusion: There is enough evidence to suggest that there is a statistically significant difference in wage between genders. That means that the probability of encountering a sample where the wages between gender are statistically equal (close enough to not matter) is very small. 

#### Part b
Are there different average values of occupational prestige for different levels of job satisfaction? [2 points]

In [78]:
gss.groupby('satjob').agg({'pres10':'mean'})

Unnamed: 0_level_0,pres10
satjob,Unnamed: 1_level_1
a little dissat,40.946429
mod. satisfied,42.589984
very dissatisfied,43.0
very satisfied,46.18932


#### Interpretation
If we look at the average prestige score between levels of job satisfaction, there seems to be a difference between scores. This provides support that there may be differences between prestige scores and satisfaction levels.

In [79]:
# hypo test
stats.f_oneway(gss.query("satjob=='a little dissat'").pres10.dropna(),
               gss.query("satjob=='mod. satisfied'").pres10.dropna(),
               gss.query("satjob=='very dissatisfied'").pres10.dropna(),
              gss.query("satjob=='very satisfied'").pres10.dropna())

F_onewayResult(statistic=12.205403153509732, pvalue=6.676686425029878e-08)

#### Interpretation of p-val, 6.676686425029878e-08
H0: The difference between prestige is not different based on job satisfaction level

HA: The difference between prestige is different based on job satisfaction level

Conclusion: There is enough evidence to suggest that there is a statistically significant difference in prestige between job satisfaction levels. That means that the probability of encountering a sample where the presitge between levels are statistically equal (close enough to not matter) is very small. 

### Problem 5
Report the Pearson's correlation between years of education, socioeconomic status, income, occupational prestige, and a person's mother's and father's occupational prestige? Then perform a hypothesis test for the correlation between years of education and socioeconomic status and provide a **specific and accurate** intepretation of the $p$-value associated with this hypothesis test beyond "significant or not". [2 points]

In [81]:
# correlation
gss[["educ","socecon","annual_income","pres10","mapres10","papres10"]].corr()

Unnamed: 0,educ,socecon,annual_income,pres10,mapres10,papres10
educ,1.0,0.558169,0.389245,0.479933,0.269115,0.261417
socecon,0.558169,1.0,0.41721,0.835515,0.203486,0.210451
annual_income,0.389245,0.41721,1.0,0.340995,0.164881,0.171048
pres10,0.479933,0.835515,0.340995,1.0,0.189262,0.19218
mapres10,0.269115,0.203486,0.164881,0.189262,1.0,0.23575
papres10,0.261417,0.210451,0.171048,0.19218,0.23575,1.0


In [85]:
# hypo test
gss_corr = gss[["educ","socecon"]].dropna()
stats.pearsonr(gss_corr['educ'], gss_corr['socecon'])

PearsonRResult(statistic=0.558168600462678, pvalue=3.7194488100312476e-184)

#### Interpretation of p-val, 3.7194488100312476e-184
H0: The correlation between years of education and socioeconomic status is 0

HA: The correlation between years of education and socioeconomic status is not 0

Conclusion: There is enough evidence to suggest that there is a nonzero correlation between years of education and socioeconomic status.

### Problem 6
Create a new categorical feature for age groups, with categories for 18-35, 36-49, 50-69, and 70 and older (see the module 8 notebook for an example of how to do this). 

Then create a cross-tabulation in which the rows represent age groups and the columns represent responses to the statement that "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family." Rearrange the columns so that they are in the following order: strongly agree, agree, disagree, strongly disagree. Place row percents in the cells of this table.

Finally, use a hypothesis test that can tell use whether there is enough evidence to conclude that these two features have a relationship, and provide a specific and accurate intepretation of the $p$-value. [2 points]

In [86]:
gss.age.sort_values().unique()

array([18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49., 50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.,
       70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89., nan])

In [87]:
gss["age_cate"] = gss.age.replace([[18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30.,
       31., 32., 33., 34., 35.], [36., 37., 38., 39., 40., 41., 42., 43.,
       44., 45., 46., 47., 48., 49.], [50., 51., 52., 53., 54., 55., 56.,
       57., 58., 59., 60., 61., 62., 63., 64., 65., 66., 67., 68., 69.], [70., 71., 72., 73., 74., 75., 76., 77., 78., 79., 80., 81., 82.,
       83., 84., 85., 86., 87., 88., 89.], np.nan],
                                 ["18-35", "36-49", "50-69", "70_over", np.nan])

In [88]:
gss.age_cate.value_counts()

age_cate
50-69      771
18-35      672
36-49      541
70_over    357
Name: count, dtype: int64

In [90]:
gss["trad_household"] = gss.trad_household.astype('category').cat.reorder_categories(["strongly agree",
                                                                                     "agree",
                                                                                     "disagree",
                                                                                     "strongly disagree"])

In [92]:
(pd.crosstab(gss.age_cate, gss.trad_household, normalize='index')*100).round(2)

trad_household,strongly agree,agree,disagree,strongly disagree
age_cate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
18-35,3.94,14.04,47.54,34.48
36-49,4.79,17.46,46.48,31.27
50-69,4.63,20.85,48.07,26.45
70_over,11.97,31.66,39.0,17.37


In [93]:
# hypo test
stats.chi2_contingency(pd.crosstab(gss.age_cate, gss.trad_household))

Chi2ContingencyResult(statistic=69.24381761791811, pvalue=2.1419004733989943e-11, dof=9, expected_freq=array([[ 23.23016905,  81.56957087, 186.89726918, 114.3029909 ],
       [ 20.31209363,  71.32314694, 163.42002601,  99.94473342],
       [ 29.63849155, 104.07152146, 238.45513654, 145.83485046],
       [ 14.81924577,  52.03576073, 119.22756827,  72.91742523]]))

#### Interpretation of p-val, 2.1419004733989943e-11
H0: A group's sentiments of a traditional household does not depend on the age range of the group

HA: A group's sentiments of a traditional household does depend on the age range of the group

Conclusion: There is enough evidence to suggest that there is significant difference in sentiments of a traditional household depending on age range. This means that there is evidences of a relationship between age and the ideology of traditional households.

### Problem 7
For this problem, you will conduct and interpret a correspondence analysis on the categorical features that ask respondents to state the extent to which they agree or disagree with the statements:
* "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* "Most men are better suited emotionally for politics than are most women."
* "A preschool child is likely to suffer if his or her mother works."
* "Family life often suffers because men concentrate too much on their work."

#### Part a
Conduct a correspondence analysis using the observed features listed above that measures two latent features. Plot the two latent categories for each category in each of the features used in the analysis. [2 points]

#### Part b
Display the latent features for every category in the observed features, sorted by the first latent feature. Describe in words what concept this feature is attempting to measure, and give the feature a name. [2 points]

#### Part c
We can use the results of the MCA model to conduct some cool EDA. For one example, follow these steps:

1. Use the `.row_coordinates()` method to calculate values of the latent feature for every row in the data you passed to the MCA in part a. Extract the first column and store it in its own dataframe.

2. To join it with the full, cleaned GSS data based on row numbers (instead of on a primary key), use the `.join()` method. For example, if we named the cleaned GSS data `gss_clean` and if we named the dataframe in step 1 `latentfeature`, we can type
```
gss_clean = gss_clean.join(latentfeature, how="outer")
```
3. Create a cross-tabuation with age categories (that you constructed in problem 5) in the rows and sex in the columns. Instead of a frequency, place the mean value of the latent feature in the cells. 

What does this table tell you about the relationship between sex, age, and the latent feature? [2 points]