**Wichtige Hinweise**

- Aktivieren Sie die entsprechende Conda-Umgebung bevor Sie beginnen.

- Geben Sie als `NAME` ihr HdM-Kürzel an.

- Ändern Sie **nicht** den Namen der Datei und löschen Sie keine Zellen.

- Bearbeiten Sie alle Zellen mit dem Hinweis <font color='green'> \# YOUR CODE HERE </font>

- Die Funktion **NotImplementedError()** soll die Abgabe von leeren Zellen verhindern. Löschen Sie die Funktion, sobald Sie in einer dieser Zellen arbeiten.

- Stellen Sie sicher, dass alles wie erwartet läuft, bevor Sie die Prüfung abgeben: Starten Sie den Kernel neu und führen Sie alle Zellen aus: wählen Sie "Restart" und dann "Run All"

In [1]:
NAME = "JOHN"

In [2]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

---

# Netflix user engagement analyse

## Setup

In [3]:
# Wir importieren die Bibliothek pandas
import pandas as pd

---
## Daten

### Daten importieren (2 Punkte)

- Wir haben Pandas mit der Bezeichnung `pd` importiert.

- Die Pandas-Funktion zum Einlesen von CSV-Dateien: `.read_csv`

- CSV-Daten mit Pandas importieren und mit der Bezeichnung `df` speichern.

- Pfad zu den Daten in GitHub: 'https://raw.githubusercontent.com/kirenz/datasets/master/netflix.csv'


- Hinweis:

```python
___ = ___.___(___)
```

In [4]:
# YOUR CODE HERE
URL = 'https://raw.githubusercontent.com/kirenz/datasets/master/netflix.csv'
df = pd.read_csv(URL)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   group    2000 non-null   object
 1   outcome  2000 non-null   object
dtypes: object(2)
memory usage: 31.4+ KB


In [6]:
"""Check that df returns the correct output"""
assert len(df) == 2000
assert df.columns.tolist() == ['group', 'outcome']

---
## Analyse

### Kreuztabelle (4 Punkte)

Schreiben Sie den Code, der eine Kreuztabelle erzeugt erzeugt:

- Speichern Sie die Tabelle unter der Bezeichnung: `cross_table`
- Die Funktion in Pandas zur Erzeugung von Kreuztabellen: `.crosstab(ZEILEN, SPALTEN)`
- Als Zeilen diese Variable nutzen: `group`
- Als Spalten diese Variable nutzen: `outcome`
- Geben Sie die Spalten- und Zeilensummen aus (`margins=True`)
- Geben Sie die Werte als Prozentwerte aus (`normalize=True`)
- Die Variable `group` soll in der Tabelle "Gruppe" genannt werden (`rownames`)
- Die Variable `outcome` soll in der Tabelle "Ergebnis" genannt werden (`colnames`)
- Formatierung der Prozent-Werte mit Multiplikation: bspw. sollte ein Wert von 0.1775 als 17.75 angezeigt werden
- Geben Sie die Tabelle aus

- Hinweis:

```python

___ = __.__(df['___'], df['___'],
            ___=True, # Spalten- und Zeilensumme
            ___=True, # Prozentwerte
            rownames=['___'],  # Umbenennung der Zeilen
            colnames=['___'] # Umbenennung der Spalten
            ) * ___ # Multiplikation mit 100 für eine bessere Darstellung der Prozentwerte

___

```

In [7]:
cross_table = pd.crosstab(df['group'], df['outcome'], rownames=['Gruppe'], colnames=['Ergebniss'], margins=True, normalize=True) * 100

In [12]:
df.head()

Unnamed: 0,group,outcome
0,A,Engagement
1,B,No engagement
2,B,No engagement
3,B,No engagement
4,B,No engagement


In [17]:
observed = df.reset_index().groupby(['group','outcome']).count().pivot_table(index='outcome', columns='group', values='index', fill_value=0)
observed

group,A,B
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1
Engagement,645.0,436.0
No engagement,355.0,564.0


In [29]:
"""Check that it returns the correct output"""
assert float(cross_table['Engagement'].B) == 21.8

In [30]:
cross_table.values

array([[32.25, 17.75],
       [21.8 , 28.2 ]])

# Variable Classification

Numeric 
- discrete (limited number of countable elements)
- continious (unlimited number of elements)

Categorical 
- Nominal (!=, =)
- Ordinal (=, !=, >=, <=)

Which visulaization to use for data type: https://www.data-to-viz.com/

Altir Data Visualization: https://altair-viz.github.io/gallery/one_dot_per_zipcode.html

# Chi-Square (χ²) Test in A/B Testing

The Chi-Square (χ²) test is used in A/B testing to determine if there’s a statistically significant difference between two groups, especially when dealing with categorical data. It checks if the distribution of observed frequencies in each group differs from what would be expected under a null hypothesis.

## When to Use the Chi-Square Test in A/B Testing
If your A/B test involves categorical outcomes (e.g., "clicked" vs. "not clicked" on an ad), the Chi-Square test is useful. It’s often used to test whether proportions (like conversion rates) differ between two groups.

## Steps to Conduct a Chi-Square Test in A/B Testing

### 1. Set Up Hypotheses
   - **Null Hypothesis (H₀)**: There is no difference in outcomes between the two groups (e.g., conversion rates in Group A and Group B are the same).
   - **Alternative Hypothesis (H₁)**: There is a difference in outcomes between the two groups (e.g., conversion rates differ).

### 2. Collect Data and Organize It into a Contingency Table
   Organize your data in a 2x2 table for two groups and two outcomes:

   | Outcome        | Group A | Group B |
   |----------------|---------|---------|
   | Converted      | a       | b       |
   | Not Converted  | c       | d       |

   Where:
   - **a** = number of conversions in Group A
   - **b** = number of conversions in Group B
   - **c** = number of non-conversions in Group A
   - **d** = number of non-conversions in Group B

### 3. Calculate the Expected Values
   For each cell in the table, calculate the expected value under the assumption that the null hypothesis is true:

  $
   \text{Expected Value for a Cell} = \frac{\text{(Row Total) × (Column Total)}}{\text{Grand Total}}
  $

### 4. Calculate the Chi-Square Statistic
   Use the Chi-Square formula:

 $
   \chi^2 = \sum \frac{(O - E)^2}{E}
 $

   Where:
   - $ O $ = observed frequency in each cell
   - $ E $ = expected frequency in each cell

   Sum this value across all cells in the table.

### 5. Determine the Degrees of Freedom (DoF)
   For a 2x2 table, the degrees of freedom \( \text{DoF} \) is 1, calculated as:

   $
   \text{DoF} = (\text{Rows} - 1) \times (\text{Columns} - 1)
   $

### 6. Find the p-value
   Using the calculated χ² value and DoF, look up the p-value in a Chi-Square distribution table, or use statistical software to determine it.

### 7. Interpret the Results
   - If the p-value is below your significance level (e.g., 0.05), reject the null hypothesis, suggesting a significant difference between Group A and Group B.
   - If the p-value is above the significance level, fail to reject the null hypothesis, indicating no significant difference.

In [19]:
observed

group,A,B
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1
Engagement,645.0,436.0
No engagement,355.0,564.0


In [18]:
import numpy as np
from scipy.stats import chi2_contingency

# Define observed frequencies in a 2x2 contingency table
# Perform the Chi-Square test
chi2, p, dof, expected = chi2_contingency(observed)

# Results
print("Chi-Square Statistic:", chi2)
print("p-value:", p)
print("Degrees of Freedom:", dof)
print("Expected Frequencies:\n", expected)

# Interpret the result
alpha = 0.05  # significance level
if p < alpha:
    print("Reject the null hypothesis: significant difference between Group A and Group B")
else:
    print("Fail to reject the null hypothesis: no significant difference between Group A and Group B")

Chi-Square Statistic: 87.09945955413467
p-value: 1.031987472111091e-20
Degrees of Freedom: 1
Expected Frequencies:
 [[540.5 540.5]
 [459.5 459.5]]
Reject the null hypothesis: significant difference between Group A and Group B


In [22]:
from scipy.stats import chi2

chi2_stat = 87.93
dof = 1

# Calculate the p-value
p_value = chi2.sf(chi2_stat, dof)
print("p-value:", p_value)

p-value: 6.78123295713596e-21


In [None]:
# Type I error is equal to the significance level alpha
type_I_error = alpha
print("Type I Error (False Positive):", typ_I_error)

