# Project Overview

In a short paragragh describe the problem you want to explore use data science techniques

# Problem Description
In one sentence describe the problem you want to explore with data science

# Subject Matter Expertise

In bulleted format, describe the subject matters that will help you explore your topic. Example:
 
1. Data Analysis
2. Data Visualization
3. Statistics and Probability
4. Hypothesis Testing
5. Linear Regression

# Assumptions
List any assumptions you may have about the topic. 
`Assumptions are a thing that is accepted as true or as certain to happen, without proof.`


# Steps to Explore the Topic and Problem

List the steps you're going to take to explore the topic using the data sources you identify and the techniques you already know. Please list them as step 1, step 2, step 3. Example:

1. Download data from U.S. Census
2. Web-scrape data from IMDB
3. Find the number of actors that do not live in california
4. Calculate the median salary for actors that do not live in California
5. Show the relationship between actors living in CA vs outside CA with pie chart
6. Show relationship between actors' salary living in CA vs outside CA with bar chart
7. Calculate median home price of homes in CA using web scrape data
8. Compare this to median salary for actors using bar chart
9. Make some intitial Conclusions on actor salary, location, and median home price on if actors can afford to live in CA.

# Data Sources:

In bulleted format, list where you will get data from. Data sources must include one existing data source and web-scrapted source. Example:

1. U.S. Census Estimates of the Total Resident Population and Resident Population Age 18 Years and Older for the United States, States, and Puerto Rico `https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/state/detail/SCPRC-EST2019-18+POP-RES.csv`
2. Kaggle Elon Musk Tweets 'https://www.kaggle.com/kingburrito666/elon-musk-tweets'

# Data Exploration
Describe the data using what you know. For larger datasets may have to pull out columns that are of interst to you.

In [3]:
# imports
import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

from datascience.util import sample_proportions

In [26]:
world_population = pd.read_csv("WorldPopulationByAge2020.csv")
world_population.head(20)

Unnamed: 0,Location,AgeGrp,PopMale,PopFemale,PopTotal
0,Afghanistan,0-19,10709.0,10197.0,20906.0
1,Afghanistan,20-39,5994.0,5574.0,11568.0
2,Afghanistan,40-59,2485.0,2316.0,4801.0
3,Afghanistan,60+,781.0,858.0,1639.0
4,Africa,0-19,344109.0,334982.0,679091.0
5,Africa,20-39,197448.0,197144.0,394592.0
6,Africa,40-59,94547.0,98460.0,193007.0
7,Africa,60+,33767.0,40123.0,73890.0
8,African Group,0-19,343795.0,334677.0,678472.0
9,African Group,20-39,197193.0,196889.0,394082.0


# Data Cleaning
Show techniques you use to reduce impact of outliers, drop missing, or null values (if any)

In [25]:
print("Location Nan values count:", world_population["Location"].isna().sum())
print("Age Group Nan values count:", world_population["AgeGrp"].isna().sum())
print("Male Population Nan values count:", world_population["PopMale"].isna().sum())
print("Female Population Nan values count:", world_population["PopFemale"].isna().sum())
print("Total Population Nan values count:", world_population["PopTotal"].isna().sum())

Location Nan values count: 0
Age Group Nan values count: 0
Male Population Nan values count: 0
Female Population Nan values count: 0
Total Population Nan values count: 0


As we can see above, we have no NaN values in the dataframe, so we do not need to filter anything. Another thing that we can see in the dataframe is "Africa" which is not necesarily a country. The column header is not country but location, so the dataframe consists of data for countries, continents, and other kinds of geeographical areas. It is not neecessary to filter those as we will be selecting only the values that we need and not be concerned with the values we do not need.

# Describe the Data Using Descriptive Stats
Use descriptive stats to tell us about your data. Must include mean, median, and mode where applicable. Also must talk about normality of data.

Remember what `standard deviation`, `mean`, `central tendency`, and `variance` mean for your data

In [32]:
def print_descriptive_stats(column):
    print("For column ", column)
    print("Mean", world_population[column].mean())
    print("Median", world_population[column].median())
    print("Mode", world_population[column].mode())
    print("Standard Deviation", world_population[column].std())
    print("Variance", world_population[column].var())
    print()

print_descriptive_stats("PopMale")
print_descriptive_stats("PopFemale")
print_descriptive_stats("PopTotal")

For column  PopMale
Mean 88324.34431818181
Median 8066.0
Mode 0    12.0
1    16.0
2    23.0
3    36.0
dtype: float64
Standard Deviation 195822.10023695984
Variance 38346294941.21395

For column  PopFemale
Mean 86534.43863636363
Median 8187.5
Mode 0    12.0
dtype: float64
Standard Deviation 187027.9601316678
Variance 34979457871.01272

For column  PopTotal
Mean 174858.78295454546
Median 16211.0
Mode 0    32.0
dtype: float64
Standard Deviation 382681.6110750213
Variance 146445215454.97388



The above descriptive statistics do not really mean anything for us because of our purpose and because of the structure of the dataframe. The location is arbitrary and it is not useful to invest time in filtering just the locations that we need when we can easily select the data that we need. 

We are concerning ourselves with only the data from Nepal, so lets find out the population for male, female, and total population for all four age groups of Nepal.





In [45]:
import statistics

# Get actual values of male and female population
def get_male_female_pop(location, age_group):
    male_pop = world_population.loc[(world_population["Location"] == location) & (world_population["AgeGrp"] == age_group), "PopMale"]
    female_pop = world_population.loc[(world_population["Location"] == location) & (world_population["AgeGrp"] == age_group), "PopFemale"]
    return (male_pop, female_pop)

nepal_age_0_19_male, nepal_age_0_19_female = get_male_female_pop("Nepal", "0-19")
nepal_age_20_39_male, nepal_age_20_39_female = get_male_female_pop("Nepal", "20-39")
nepal_age_40_59_male, nepal_age_40_59_female = get_male_female_pop("Nepal", "40-59")
nepal_age_60_plus_male, nepal_age_60_plus_female = get_male_female_pop("Nepal", "60+")

nepal_male_pop = [nepal_age_0_19_male.iloc[0], nepal_age_20_39_male.iloc[0], nepal_age_40_59_male.iloc[0], nepal_age_60_plus_male.iloc[0]]
nepal_female_pop = [nepal_age_0_19_female.iloc[0], nepal_age_20_39_female.iloc[0], nepal_age_40_59_female.iloc[0], nepal_age_60_plus_female.iloc[0]]

print(nepal_male_pop)
print(nepal_female_pop)

nepal_total_male_pop = sum(nepal_male_pop)
nepal_total_female_pop = sum(nepal_female_pop)
nepal_total_population = nepal_total_male_pop + nepal_total_female_pop
nepal_male_portion = nepal_total_male_pop / nepal_total_population
nepal_female_portion = nepal_total_female_pop / nepal_total_population
print("Male:Female ratio in total Nepal population: " + str(nepal_male_portion) + ":" + str(nepal_female_portion) + "\n")

def print_descriptive_stats_nepal(nepal_male_pop, nepal_female_pop):
    print("For Nepal male population:")
    print("Mean: ", statistics.mean(nepal_male_pop))
    print("Median: ", statistics.median(nepal_male_pop))
    print("Standard Deviation: ", statistics.stdev(nepal_male_pop))
    print("Variance: ", statistics.variance(nepal_male_pop))
    print()

    print("For Nepal female population:")
    print("Mean: ", statistics.mean(nepal_female_pop))
    print("Median: ", statistics.median(nepal_female_pop))
    print("Standard Deviation: ", statistics.stdev(nepal_female_pop))
    print("Variance: ", statistics.variance(nepal_female_pop))

print_descriptive_stats_nepal(nepal_male_pop, nepal_female_pop)

[5848.0, 4050.0, 2291.0, 1149.0]
[5732.0, 5626.0, 3058.0, 1365.0]
Male:Female ratio in total Nepal population: 0.45805144407431575:0.5419485559256842

For Nepal male population:
Mean:  3334.5
Median:  3170.5
Standard Deviation:  2057.095444228747
Variance:  4231641.666666667

For Nepal female population:
Mean:  3945.25
Median:  4342.0
Standard Deviation:  2118.356182672467
Variance:  4487432.916666667


# Data Visualization Continued
Use everything from the miniproject 1 to describe the plots you will use to visualize you data. 

Must use histogram, bar charts, and pie charts to describe aspects about the data

Determine if your data is normal or not!



In [None]:
def visualize_histogram()

## Data Sampling

Draw random some samples from your dataset(s)
(If you joined your dataset into one large dataset, only sample randomly from this large dataset)

You must describe how you choose to pick the random samples, i.e. systematic, probablistic sampling

You must describe if you draw random samples with or without replacement

Describe why or why not you chose to randomly draw samples with replacement or without replacement

## Find Probability
Find the probability of two events that must both happen in your data analysis

Find the probablity of an event that doesn't happen using your dataset(s)

Find the probability of event that is equally likely occur. You might have to think about this in regards to your problem that you are exploring



## Testing Hypotheses

Choose two hypotheses you want to explore in regards to your topic.
Examples:
The state of Montana likes cheesecake as a desert?
Have the demographics of Washington, DC decreased over the past decade?
United States streaming services mainly stream rap or hip/hop?

### Note: Make sure to note the Null and the alternative hypotheses for each of the questions you want to test

## Test Statistics 
For each of your questions list the test statistic that you are going to use to test the hypothesis.

## Observed Values
Show the oberved value of the test statistics

# Hypothesis Tests Conclusions
Based on what you found about your topic communicate it to the audience. 

# Topic Conclusions
Sum up the conclusions about what you did and why it is interesting to you, the public reading your analysis or to a particular population