<h1>Casestudy Eric Bühler - Aachen </h1>

In this notebook, some initial data analysis along with the regressions are calculated.

In [4]:
# Packages used in this notebook
import requests
from nltk.sentiment import SentimentIntensityAnalyzer
from bs4 import BeautifulSoup
import collections
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import re
import nltk
import numpy as np
import pandas as pd
import configparser
import json
from datetime import datetime

<h3>Analysis of Verbindung and Verbindung möglich variables</h3>

To avoid confusion with the fraternity variable, I have included two ways of evaluating wether a listing is a fraternity: Either by checking the provided texts for synonyms of fraternity or, since not all fraternities identify themselves as such, by looking for listings with low rent and a high number of roomates. To avoid colinearity however, only one of these should be included in the regression. To check wether we are not loosing a big amount of information, the following code checks the overlap of the two variables.

In [5]:
# Load the CSV file into a DataFrame
file_path = '/workspaces/fdap-2024-Big-Eric-Blip/casestudy/student_housing/data_analysis/anzeigen.csv'  # Path to the CSV file
anzeigen = pd.read_csv(file_path)

# Condition checks
condition_one = (anzeigen['verbindung'] == True) & (anzeigen['verbindung_moeglich'] == False)
condition_two = (anzeigen['verbindung'] == False) & (anzeigen['verbindung_moeglich'] == True)

# Check condition one
rows_with_condition_one = anzeigen[condition_one]
if not rows_with_condition_one.empty:
    print("There is at least one row where 'verbindung' is True and 'verbindung_moeglich' is False.")
else:
    print("There are no rows where 'verbindung' is True and 'verbindung_moeglich' is False.")

num_rows_one = condition_one.sum()
print(f"There are {num_rows_one} rows where 'verbindung' is True and 'verbindung_moeglich' is False.")

# Check condition two
rows_with_condition_two = anzeigen[condition_two]
if not rows_with_condition_two.empty:
    print("There is at least one row where 'verbindung' is False and 'verbindung_moeglich' is True.")
else:
    print("There are no rows where 'verbindung' is False and 'verbindung_moeglich' is True.")

num_rows_two = condition_two.sum()
print(f"There are {num_rows_two} rows where 'verbindung' is False and 'verbindung_moeglich' is True.")


There is at least one row where 'verbindung' is True and 'verbindung_moeglich' is False.
There are 2 rows where 'verbindung' is True and 'verbindung_moeglich' is False.
There is at least one row where 'verbindung' is False and 'verbindung_moeglich' is True.
There are 19 rows where 'verbindung' is False and 'verbindung_moeglich' is True.


<h3>Evaluating NDVI of the districts</h3>

In this cell, we extract the NDVI per PLZ from the [urban_green_spaces.ipynb](/workspaces/fdap-2024-Big-Eric-Blip/casestudy/student_housing/google_earth_engine/urban_green_spaces.ipynb) notebook and order them by descending NDVI. Districts with a high amount of vegetation will later be compared to districts with a high price point.

In [6]:
import pandas as pd

# Load the CSV data into a DataFrame
df = pd.read_csv('/workspaces/fdap-2024-Big-Eric-Blip/casestudy/student_housing/google_earth_engine/ndvi_2020_results.csv')

# Sort the DataFrame by 'mean_ndvi' in descending order
sorted_df = df.sort_values(by='mean_ndvi', ascending=False)

# Display the sorted DataFrame
print("\nSorted DataFrame by Decreasing mean NDVI:")
display(sorted_df)

# Optionally, save the sorted DataFrame to a new CSV file
sorted_df.to_csv('sorted_ndvi_2020_results.csv', index=False)



Sorted DataFrame by Decreasing mean NDVI:


Unnamed: 0,plz_code,mean_ndvi
6,52076,0.283134
1,52074,0.282947
9,52080,0.23988
3,52078,0.230341
4,52072,0.229945
0,52066,0.207245
2,52070,0.200926
8,52068,0.145891
7,52064,0.135038
5,52062,0.099158


<h2>Regressions</h2>

<h3>Analysis of rent</h3>

In the following code, we calculate a regression on the dependent variable 'price' by the independant variables bewohner, goresse, miete and plz.

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import re

# Define the necessary columns for the DataFrame
necessary_columns = ['bewohner', 'groesse', 'miete', 'plz']

# Load the CSV file into a DataFrame
file_path = '/workspaces/fdap-2024-Big-Eric-Blip/casestudy/student_housing/data_analysis/anzeigen.csv'  # Path to the CSV file
df = pd.read_csv(file_path)

# Create a DataFrame with only the necessary columns
df = df[necessary_columns]

# Convert numeric columns to numeric types and handle errors
df['bewohner'] = pd.to_numeric(df['bewohner'], errors='coerce')
df['groesse'] = pd.to_numeric(df['groesse'], errors='coerce')
df['miete'] = pd.to_numeric(df['miete'], errors='coerce')

# To ensure the rent is in rent per square meter:
df['miete'] = df['miete'] / df['groesse']

# Drop rows with missing values in numeric columns
df = df.dropna(subset=['bewohner',  'miete'])

# Convert 'plz' to categorical and create dummy variables
df['plz'] = df['plz'].astype('category')
df = pd.get_dummies(df, columns=['plz'], drop_first=True)

# Add the intercept (constant term) for the regression model
df = sm.add_constant(df)

# Round numeric values to a specified number of decimal places
df = df.round({'bewohner': 0, 'miete': 2})

# Print data types and a sample of the DataFrame for debugging
# Prepare the data for regression
X = df.drop(columns=['miete','groesse'])  # Exclude dependent variable 'miete'
y = df['miete']

# Ensure all columns used for regression are numeric
X = X.apply(pd.to_numeric, errors='coerce')  # Convert to numeric, forcing non-numeric to NaN
X = X.fillna(0)  # Fill NaNs with 0 or some other value depending on context

# Display only the column names
display(X.columns.tolist())

# Convert to np
X = np.asarray(X)
X = np.array(X , dtype=float)
y = np.asarray(y)
y = np.array(y, dtype=float)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the summary of the regression with custom variable names
print("Regression Summary:")
print(model.summary())  # Coefficients table





['const',
 'bewohner',
 'plz_52064',
 'plz_52066',
 'plz_52068',
 'plz_52070',
 'plz_52072',
 'plz_52074',
 'plz_52078',
 'plz_52080',
 'plz_52159',
 'plz_52249']

Regression Summary:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.242
Model:                            OLS   Adj. R-squared:                  0.206
Method:                 Least Squares   F-statistic:                     6.700
Date:                Mon, 08 Jul 2024   Prob (F-statistic):           9.98e-10
Time:                        14:58:57   Log-Likelihood:                -830.21
No. Observations:                 243   AIC:                             1684.
Df Residuals:                     231   BIC:                             1726.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         32.1251      1.267


<h3>Analysis of listing-duration</h3>

In the following code, we calculate a regression on the dependent variable 'online_seit' by the independant variables bewohner, goresse, miete, sentiment, verbindung and plz.

In [11]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import re

# Define the necessary columns for the DataFrame
necessary_columns = ['bewohner', 'groesse', 'miete', 'plz','online_seit','sentiment','verbindung']

# Load the CSV file into a DataFrame
file_path = '/workspaces/fdap-2024-Big-Eric-Blip/casestudy/student_housing/data_analysis/anzeigen.csv'  # Path to the CSV file
df = pd.read_csv(file_path)

# Create a DataFrame with only the necessary columns
df = df[necessary_columns]

# Convert numeric columns to numeric types and handle errors
df['bewohner'] = pd.to_numeric(df['bewohner'], errors='coerce')
df['groesse'] = pd.to_numeric(df['groesse'], errors='coerce')
df['miete'] = pd.to_numeric(df['miete'], errors='coerce')
df['sentiment'] = pd.to_numeric(df['sentiment'], errors='coerce')*100
df['online_seit'] = pd.to_numeric(df['online_seit'], errors='coerce')
df['sentiment'] = pd.to_numeric(df['sentiment'], errors='coerce')
df['verbindung'] = pd.to_numeric(df['verbindung'], errors='coerce')

# Drop rows with missing values in numeric columns
df = df.dropna(subset=['bewohner', 'groesse', 'miete'])

# Convert 'plz' to categorical and create dummy variables
df['plz'] = df['plz'].astype('category')
df['verbindung'] = df['verbindung'].astype('category')
df = pd.get_dummies(df, columns=['plz'], drop_first=True)
df = pd.get_dummies(df, columns=['verbindung'], drop_first=True)

# Add the intercept (constant term) for the regression model
df = sm.add_constant(df)

# Round numeric values to a specified number of decimal places
df = df.round({'bewohner': 0, 'groesse': 0, 'miete': 2,'sentiment': 5})

# Print data types and a sample of the DataFrame for debugging
# Prepare the data for regression
X = df.drop(columns=['online_seit'])  # Exclude dependent variable 'online_seit'
y = np.log(df['online_seit'])

# Ensure all columns used for regression are numeric
X = X.apply(pd.to_numeric, errors='coerce')  # Convert to numeric, forcing non-numeric to NaN
X = X.fillna(0)  # Fill NaNs with 0 or some other value depending on context

# Display only the column names
display(X.columns.tolist())

# Convert to np
X = np.asarray(X)
X = np.array(X , dtype=float)
y = np.asarray(y)
y = np.array(y, dtype=float)

# Fit the model
model = sm.OLS(y, X).fit()

# Print the summary of the regression with custom variable names
print("Regression Summary:")
print(model.summary())  # Coefficients table



['const',
 'bewohner',
 'groesse',
 'miete',
 'sentiment',
 'plz_52064',
 'plz_52066',
 'plz_52068',
 'plz_52070',
 'plz_52072',
 'plz_52074',
 'plz_52078',
 'plz_52080',
 'plz_52159',
 'plz_52249',
 'verbindung_True']

Regression Summary:
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.251
Model:                            OLS   Adj. R-squared:                  0.202
Method:                 Least Squares   F-statistic:                     5.077
Date:                Mon, 08 Jul 2024   Prob (F-statistic):           1.24e-08
Time:                        13:42:54   Log-Likelihood:                -506.31
No. Observations:                 243   AIC:                             1045.
Df Residuals:                     227   BIC:                             1101.
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.6282      0.851