## **Lab 03 - EDA with Statistical Testing**

Paige Rosynek, 01.05.2023

### **Introduction**

In this lab we will be exploring various statstical tests to identify correlated and predictive variables. We will go through a simple problem to further explore the two-sample t-test and to gain experience with interpreting the results of statstical tests. In addition, we will first be researching Pearson's Correlation, Kruskal-Wallis, and chi-squared test of independence statistical tests before using each on the cleaned Sacremento real estate data set. Then we will compare the results of these tests with our analysis from Lab 2 in order to confirm the predictiveness of variables for predicting property type and price.

**Import libraries**

In [1]:
from scipy import stats
import pandas as pd

### **Part I: Review of Statistical Tests**

**Problem:**

Let's say that you decide you want to know if playing video games impacts students' 
grades.  You set up a survey which asks students two questions: 
 
1. Do you play video games regularly?  Yes / no 
2. What is your GPA? 
 


**Hypothesis:** If a student plays video games regularly, then they will have an average GPA of 2.9 or lower.

**Results:**

You now decide to look at the survey results.  You have 100 responses!  68 students said 
they play video games regularly, while 32 students said they did not.   The 68 games 
have an average GPA of 3.4 with a standard deviation of 1.2, while the 32 non-gamers 
have an average GPA of 3.3 with a standard deviation of 1.1. 

**Two-sample t-test**

A two-sample t-test is used in situations in which the relationship between a measurement variable and a catgorical variable, that has exactly 2 categories, is being explored. The two-sample t-test tests whether the means of the two categories or groups are different. In other words, the test should be used when you want to compare the difference between the means of the measurement variable of the two categories. Based on these specifications, we can apply the two-sample t-test to the situation described above. Since, in the problem, there is one measurement variable, GPA, and one categorical variable, whether or not a student plays video games regularly, which has only two catgories, yes and no, so the two-sample t-test can be used. One assumption that is made by the two-sample t-test is that the observations within each group are normally distributed, which this may not hold true for the problem described above.

**Null hypothesis:** The means of the GPA for the groups of students that play video games regularly and students that do not, are equal.

**Alternative hypothesis:** The means of the GPA for the groups of students that play video games regularly and students that do not, are different.

**Perform two-sample t-test on the problem data**

In [2]:
statistic, p_value = stats.ttest_ind_from_stats(mean1=3.4, std1=1.2, nobs1=68, mean2=3.3, std2=1.1, nobs2=32)
print(f'statistic = {statistic}\tp_value = {p_value}')

statistic = 0.39893881176878243	p_value = 0.6908062583072547


**fail to reject null hypothesis do not have sufficient evidence to accept alternative hypothesis**

After performing the two-sample t-test on the problem data, the calculated p-value for data is $0.6908062583072547$. Using a significance threshold of $0.01$ on the p-value, we can conclude that the observations of the experiment are not statistically significant because the calculated p-value is much greater than the threshold. In regards to the problem, this means that the difference between the means of the GPA's of the two groups is not statstically significant. Therefore, we accept the null hypothesis which states that the mean of the GPA's of each of the two groups are equal. These results disprove my original hypothesis that students who play video games regularly will have a GPA of 2.9 or lower.

### **Part II: Exploring Additional Statistical Tests**

**Pearson's Correlation**

- **When to use:** 
    - Two numerical variables.
    <br><br>
- **Assumptions:** 
    - The data has a linear relationship or structure.
    - Samples are independent of each other. 
    - Homoscedasticity (there is a similar spread across the range). The variance around the regression line is the same for all values of the predictor variable. 
    - Both variables are normally distributed.
    <br><br>
- **Null hypothesis:**
    - The slope of the best fit line is equal to 0 (there's no correlation between the two variables).
    <br><br>
- **Alternative hypothesis**
    - The slope of the best fit line is not equal to 0 (there is a correlation between the two variables).
    <br><br>
- **Statistical significance**
    - If the test indicates statistical significance, then we conclude that the two variables are correlated. If the test does not indicate statstical significance, then we conclude that the two variables are not correlated.


**Kruskal-Wallis Test**

- **When to use:**
    - One categorical and one numerical variable (converted to ranks with rank of 1 being the smallest measurement).
    <br><br>
- **Assumptions:**
    - Does **not** assume data is normally distributed.
    - Observations in each group come from similarly distributed data.
    - Observations within each group are indpendent. 
    <br><br>
- **Null hypothesis:**
    - The mean ranks of the groups are the equal.
    <br><br>
- **Alternative hypothesis**
    - The mean ranks of at least one of the groups is different from the other groups.
    <br><br>
- **Statistical significance**
    - If the test indicates statistical significance, then we conclude that the mean ranks of the groups are different. If the test does not indicate statstical significance, then we conclude that the mean ranks of the groups are the same.

**Chi-Square Test**

- **When to use:**
    - Two categorical variables.
    <br><br>
- **Assumptions:**
    - Observations are independent.
    - The sample size is large.
    - No cell in the table should have an expected count of less than one.
    - No more than 20% of the cells should have an expected count of less than five.
    <br><br>
- **Null hypothesis:**
    - The categorical variables are independent (there is no association between the two variables).
    <br><br>
- **Alternative hypothesis**
    - The two categorical variables are dependent (there is an association between the two variables).
    <br><br>
- **Statistical significance**
    - If the test indicates statistical significance, then we conclude that the two categorical variables are dependent and are associated. If the test does not indicate statistical significance, then we conclude that the categorical variables are independent and are not associated.

### **Part III: Regression on Price**

Load the cleaned Sacremento real estate dataset

In [3]:
df = pd.read_csv('../data/sacramento_real_estate_clean.csv')

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 984 entries, 0 to 983
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   street       984 non-null    object 
 1   city         984 non-null    object 
 2   zip          984 non-null    int64  
 3   state        984 non-null    object 
 4   beds         984 non-null    int64  
 5   baths        984 non-null    int64  
 6   sq__ft       984 non-null    int64  
 7   type         984 non-null    object 
 8   sale_date    984 non-null    object 
 9   price        984 non-null    int64  
 10  latitude     984 non-null    float64
 11  longitude    984 non-null    float64
 12  empty_lot    984 non-null    bool   
 13  street_type  984 non-null    object 
dtypes: bool(1), float64(2), int64(5), object(6)
memory usage: 101.0+ KB


##### *Fit a linear regression model and Pearson's correlation coefficient for each continous variable against the property price. Then determine statistical significance using $\alpha = 0.01$*

**Square footage**

In [5]:
slope, intercept, r_value, p_value, stderr = stats.linregress(df['price'], df['sq__ft'])
print(f'pearson coefficient = {r_value}\tp_value = {p_value}')

pearson coefficient = 0.3347796424579801	p_value = 3.386589667562419e-27


**Latitude**

In [6]:
slope, intercept, r_value, p_value, stderr = stats.linregress(df['price'], df['latitude'])
print(f'pearson coefficient = {r_value}\tp_value = {p_value}')

pearson coefficient = -0.03965005291124815	p_value = 0.21398529101172617


**Longitude**

In [7]:
slope, intercept, r_value, p_value, stderr = stats.linregress(df['price'], df['longitude'])
print(f'pearson coefficient = {r_value}\tp_value = {p_value}')

pearson coefficient = 0.28514597550819737	p_value = 7.281822955898328e-20


| **Variable** | **p-value** | **Statistically Significant? $p < \alpha = 0.01$** |
|----------|----------|----------|
| sq__ft   | 3.386589667562419e-27   | Yes   |
| latitude   | 0.21398529101172617   | No   |
| longitude   | 7.281822955898328e-20   | Yes   |


The results of the variables sq__ft and longitude produced p-values less than $\alpha = 0.01$ which means they are statistically significant and are predictive of price.

##### *Perform the Kruskal-Wallis test on each categorical variable versus price.*

**get rid of groups with samples < 5**

**Property type**

In [8]:
samples_by_group = [] 
for value in set(df['type']): 
    mask = df['type'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 29.905280806258325	p_value = 3.207382704900952e-07


**City**

In [9]:
samples_by_group = [] 
for value in set(df['city']): 
    mask = df['city'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 332.3165420913258	p_value = 3.713804860597625e-49


**Beds**

In [10]:
samples_by_group = [] 
for value in set(df['beds']): 
    mask = df['beds'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 191.35629715694265	p_value = 7.750684335831685e-38


**Baths**

In [11]:
samples_by_group = [] 
for value in set(df['baths']): 
    mask = df['baths'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 244.20693200117202	p_value = 9.614316922056859e-51


**Street type**

In [12]:
samples_by_group = [] 
for value in set(df['street_type']): 
    mask = df['street_type'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 119.78778104892767	p_value = 7.810255154460825e-16


**Empty lot**

In [13]:
samples_by_group = [] 
for value in set(df['empty_lot']): 
    mask = df['empty_lot'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 4.106841729338111	p_value = 0.04271004986704472


**Zip**

In [14]:
samples_by_group = [] 
for value in set(df['zip']): 
    mask = df['zip'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 488.3199514784271	p_value = 2.792740902787263e-65


Note: the state variable does not meet the requirements of the Kruskal-Wallis test because there is only one group or category for all the observations.

| **Variable** | **p-value** | **Statistically Significant? $p < \alpha = 0.01$** |
|----------|----------|----------|
| type   | 3.207382704900952e-07   | Yes   |
| city   | 3.7138048605991674e-49   | Yes  |
| beds | 7.750684335831685e-38   | Yes   |
| baths | 9.614316922056859e-51   | Yes   |
| street_type | 7.810255154459292e-16   | Yes   |
| empty_lot | 0.04271004986704472   | No   |
| zip | 2.792740902787263e-65   | Yes   |

Overall, the results of the statistical tests agree with my analysis of the visualizations from Lab 2. However, in my Lab 2 analysis I concluded that longitude was not predictive of property price, but according to the linear regression test above, the relationship between price and longitude is statstically significant. Therefore, longitude is predictive of price, which goes against my original conclusion.

### **Part IV: Classification on Property Type**

##### *Run Kruskal-Wallis test for each continuous variable versus the property type.*

**Price**

In [15]:
samples_by_group = [] 
for value in set(df['type']): 
    mask = df['type'] == value 
    samples_by_group.append(df['price'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 29.905280806258325	p_value = 3.207382704900952e-07


**Latitude**

In [16]:
samples_by_group = [] 
for value in set(df['type']): 
    mask = df['type'] == value 
    samples_by_group.append(df['latitude'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 2.3667464487727994	p_value = 0.30624396471011495


**Longitude**

In [17]:
samples_by_group = [] 
for value in set(df['type']): 
    mask = df['type'] == value 
    samples_by_group.append(df['longitude'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 0.44143160518918306	p_value = 0.8019445584702664


**Square footage**

In [18]:
samples_by_group = [] 
for value in set(df['type']): 
    mask = df['type'] == value 
    samples_by_group.append(df['sq__ft'][mask])

stat, p_value = stats.kruskal(*samples_by_group)
print(f'stat = {stat}\tp_value = {p_value}')

stat = 54.20795217988388	p_value = 1.6939194179245939e-12


| **Variable** | **p-value** | **Statistically Significant? $p < \alpha = 0.01$** |
|----------|----------|----------|
| price   | 3.207382704900952e-07   | Yes   |
| latitude   | 0.30624396471011495   | No   |
| longitude   | 0.8019445584702664   | No   |
| sq__ft   | 1.6939194179245939e-12   | Yes   |

##### *Run the Chi-Square test of independence on between each categorical variable and the property type*

**Street type**

In [19]:
combination_counts = pd.crosstab(df['street_type'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 164.1111276102213	p_value = 2.392898504927e-16


**City**

In [20]:
combination_counts = pd.crosstab(df['city'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 52.99924802253293	p_value = 0.9690599778204446


**Beds**

In [21]:
combination_counts = pd.crosstab(df['beds'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 356.4586026562881	p_value = 1.816876113747052e-67


**Baths**

In [22]:
combination_counts = pd.crosstab(df['baths'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 225.83626587573647	p_value = 6.406801519550199e-43


**Empty lot**

In [23]:
combination_counts = pd.crosstab(df['empty_lot'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 3.6406022453620115	p_value = 0.16197696865046218


**Zip**

In [24]:
combination_counts = pd.crosstab(df['zip'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 203.22283050088117	p_value = 0.00010690898007077515


**State**

In [25]:
combination_counts = pd.crosstab(df['state'], df['type']) 
chi2, p_value, _, _ = stats.chi2_contingency(combination_counts)
print(f'chi2 = {chi2}\tp_value = {p_value}')

chi2 = 0.0	p_value = 1.0


| **Variable** | **p-value** | **Statistically Significant? $p < \alpha = 0.01$** |
|----------|----------|----------|
| street_type   | 2.392898504927e-16   | Yes   |
| city   | 0.9690599778204446   | No   |
| beds   | 1.816876113747052e-67   | Yes   |
| baths   | 6.406801519550199e-43   | Yes   |
| empty_lot   | 0.16197696865046218   | No   |
| zip   | 0.00010690898007077515   | Yes   |
| state   | 1.0   | No   |

In Lab 2, I incorrectly concluded the predictiveness of street_type and city. According to the results of the chi-squared test above, street_type is predictive of the property type, however in Lab 2 I concluded the opposite. In addition, the test above shows that city is not predictive of the property type, however in Lab 2 I concluded the opposite. For the rest of the variables, the results from the statstical tests above confirm my conclusions made in Lab 2. 

### **Conclusion**

One of the biggest takeaways from this lab is that visualizing data, like we did in Lab 2, allows you to make decent assumptions on what variables are correlated or predictive, but by running various statstical tests on the data, like we did above, it allows you to confirm or disprove any assumptions made from the data visualizations. I found that I made incorrect assumptions for some of the variables in Lab 2 because I was unable to interpret the relationship accurately from the data visualizations. However, the statstical tests allowed for the relationships to be quantified, which made the relationships between the variables much clearer.