# Jupyter Notebook for Final Assignment DiSC Fall 2018

***Complete all of the prompts below. They will serve both as demonstrations of your technical/analytic skills and as source material for the substantive discussions in the remainder of your project. Download and rename this Jupyter notebook from the finalProject folder of the github, then follow the directions completely concerning each prompt.***

*All data are drawn from the [Million Song Subset](https://labrosa.ee.columbia.edu/millionsong/).*

*Please use a p<0.05 significance level for all hypothesis testing.*

## Prompt 1: Genre and Decade

***Some musical genres seem to be more enduring than others. Rock, Adult Contemporary, Country and Gospel/Christian are a few examples of genre labels that continue to be salient over many years, even if the style itself changes throughout that time. Choose 3 common genres to test whether songs (in the subset) are released with the same relative frequency across the 1980s, 1990s and 2000s between the three genres.***

In [None]:
import pandas as pd

In [None]:
#Download data for songs in genres with 50 or more songs in the subset
indata1 = pd.read_csv('https://raw.githubusercontent.com/ndporter/pythonDiSC/master/finalProject/DiSC_Million_Songs_Popular_Tags.csv')

In [None]:
###Just run this cell and don't worry about what it means unless you're interested###

#create decades variable
#function to run on each row to create a (string) decade variable
def makedecade(row):
    if 1980<=row['year']<1990:
        val='80s'
    elif 1990<=row['year']<2000:
        val='90s'
    elif 2000<=row['year']<2010:
        val='00s'
    else:
        val='missing'
    return val
#apply does the same thing to each column (axis=0) or row (axis=1) in the dataframe
#so below, I apply the makedecade function to each row to create a new column with decades
indata1['decade'] = indata1.apply(makedecade,axis=1) 

In [None]:
indata1.head()

In [None]:
#remove decades outside the period of interest (and data missing on year)
indata1 = indata1[indata1['decade']!='missing']

In [None]:
indata1.head()

In [None]:
#View a list of genres to choose from
indata1['artist_mbtags'].groupby(indata1['artist_mbtags']).count()

*Choose three genres to investigate and create a hypothesis about them related to the prompt above. Include a theoretical rationale justifying your hypothesis in terms of the data source and/or the genres themselves and write both the hypothesis and justification (in words) in the cell below.*

**Prompt 1 Hypothesis:**

type here

*Formally state the null (H0) hypotheses in the cell below.*

**Prompt 1 Null Hypotheses:**

type here

In [None]:
#Restrict to key genres by ***replacing sample with your genres of interest***
genres = ['folk','rock','soul and reggae']

def genreIndicator(row):
    temp = 0
    for genre in genres:
        if row['artist_mbtags']==genre:
            temp += 1
    return temp

indata1['keep']=indata1.apply(genreIndicator,axis=1)

indata1 = indata1[indata1['keep'] != 0]
indata1.head()

*Test your hypotheses using the appropriate statistical test. You do not need to worry about missing data here.*

In [None]:
#type/paste your code here
#you can use "insert-insert cell below" to add additional cells instead of running all at once


*Report the results of your null hypothesis test formally. What test did you use and what was the significance (p) value? Did you reject or fail to reject the null hypothesis?*

**Prompt 1 Null Hypothesis Findings:**

type here

*Interpret the results of your null hypothesis test in words. What do your findings tell us about these genres (at least in the Million Song Data Subset)?*

type here

*If you rejected the null hypothesis: describe one possible mechanism that might drive this difference (e.g. why would the distribution in the data across decades differ by genre).*

*If you failed to reject the null hypothesis: describe one possible reason for the null finding (e.g. why wouldn't the distribution in the data across decades differ by genre).*

*In either case, your explanation may include any combination of substantive, epistemological, or measurement issues.*

**Prompt 1 Discussion of Findings:**

type here

## Prompt 2: Artist and Musical Characteristics

***Investigate whether a musical characteristic (loudness, tempo, or duration) varies between two artists of your choice.***

In [67]:
import pandas as pd

In [None]:
#Download data for artists with 10 or more songs in the subset
indata2 = pd.read_csv('https://raw.githubusercontent.com/ndporter/pythonDiSC/master/finalProject/DiSC_Million_Songs_Artists.csv')

In [None]:
#remove missing data on all variables
vars = ['loudness','tempo','duration']
indata2 = indata2.dropna(subset=vars) #defined Nan
for var in vars:
    indata2 = indata2[indata2[var]!=0] #missing set to zero

In [None]:
#Display all artists to choose from
indata2['artist_name'].groupby(indata2['artist_name']).count()

*Choose two artists to investigate and create a hypothesis about them related to the prompt above. Include a rationale justifying your hypothesis in terms of the data source and/or the artists themselves and write both the hypothesis and justification (in words) in the cell below.*

**Prompt 2 Hypothesis:**

type here

*Formally state the null (H0) hypotheses in the cell below.*

**Prompt 2 Null Hypotheses:**

type here

*Test your hypotheses using the appropriate statistical test. You do not need to worry about missing data here.*

In [None]:
#type/paste your code here
#you can use "insert-insert cell below" to add additional cells instead of running all at once


*Report the results of your null hypothesis test formally. What test did you use and what was the significance (p) value? Did you reject or fail to reject the null hypothesis?*

**Prompt 2 Null Hypothesis Findings:**

type here

*Interpret the results of your null hypothesis test in words. What do your findings tell us about these artists (at least in the Million Song Data Subset)?*

type here

*If you rejected the null hypothesis: describe one possible mechanism that might drive this difference (e.g. why would the variable's values differ between the two artists).*

*If you failed to reject the null hypothesis: describe one possible reason for the null finding (e.g. why wouldn't the variable's values differ between the two artists).*

*In either case, your explanation may include any combination of substantive, epistemological, or measurement issues.*

**Prompt 2 Discussion of Findings:**

type here

# Prompt 3: Year and Hotttnesss/Familiarity

***There has been some speculation in this class that the hotttnesss and familiarity measurement in the dataset might be biased toward music that was newer  when the dataset was collected (2010). Investigate the relationship between year (not recoded) and the three variables to test whether there is reason to further investigate the potential for such bias.***

In [None]:
import pandas as pd

In [None]:
#Download data for the full subset
indata3 = pd.read_csv('https://raw.githubusercontent.com/ndporter/pythonDiSC/master/finalProject/DiSC_Million_Songs_Data.csv')

In [None]:
indata3.head()

In [None]:
#remove missing data on all variables
vars = ['year','song_hotttness','artist_hotttness','familiarity']
indata3 = indata3.dropna(subset=vars) #defined NaN
for var in vars:
    indata3 = indata3[indata3[var]!=0] #missing set to zero
indata3 = indata3[vars] #drop exra variables

*Choose one variable (song_hotttnesss, artist_hotttnesss or artist_familiarity) to investigate and create a hypothesis about its relationship to year based on the prompt above. Include a rationale justifying your hypothesis in terms of the data source and/or the artists themselves and write both the hypothesis and justification (in words) in the cell below.*

**Prompt 3 Hypothesis:**

type here

*Formally state the null (H0) hypotheses in the cell below.*

**Prompt 3 Null Hypotheses:**

type here

*Create a correlation matrix of all variables (note that year is in calendar years, not decades like in prompt 1)*

In [None]:
#type/paste your code here
#you can use "insert-insert cell below" to add additional cells instead of running all at oncle

*Interpret the sign and relative sizes of the correlations of year with all three other variables. You do not need to discuss significance yet.* 

type here

*Test your hypothesis using the appropriate statistical test. You do not need to worry about missing data here.*

In [None]:
#type/paste your code here
#you can use "insert-insert cell below" to add additional cells instead of running all at once

*Report the results of your null hypothesis test formally. What test did you use and what was the significance (p) value? Did you reject or fail to reject the null hypothesis?*

**Prompt 3 Null Hypothesis Findings:**

type here

*Interpret the results of your null hypothesis test in words. What do your findings tell us about the variables and/or data?*

type here

*If you rejected the null hypothesis: describe one possible mechanism that might drive this difference (e.g. why would the variable's values in the data differ across years).*

*If you failed to reject the null hypothesis: describe one possible reason for the null finding (e.g. why wouldn't the variable's values in the data differ across years).*

*In either case, your explanation may include any combination of substantive, epistemological, or measurement issues.*

**Prompt 3 Discussion of Findings:**

type here