# Megatutorial 1: Explorative Datenanalyse

In diesem Megatutorial beschäftigen wir uns mit der explorativen Datenanalyse in Python.

## Aufgaben

* Lade die Daten in `pandas`.
* Führe eine deskriptive Datenanalyse mit geeigneten, statisches Maßen durch.
* Führe eine deskriptive Datenanalyse mit Hilfe von geeigneten Visualisierungen durch.

## Daten in `pandas` laden

In [1]:
from pandas import read_csv
from matplotlib import pyplot as plt
import seaborn as sns

In [2]:
data = read_csv(
    "../../data/fake_profiles.csv",
    sep=",",
    decimal=".",
    usecols=range(1, 13)
)

## Einfache statistische Auswertungen

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 576 entries, 0 to 575
Data columns (total 12 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   profile_pic                    576 non-null    object 
 1   rel_num_numeric_char_username  576 non-null    float64
 2   words_fullname                 576 non-null    int64  
 3   rel_num_numeric_char_fullname  576 non-null    float64
 4   name=username                  576 non-null    object 
 5   description_length             576 non-null    int64  
 6   has_external_url               576 non-null    object 
 7   is_private                     576 non-null    object 
 8   num_posts                      576 non-null    int64  
 9   num_followers                  556 non-null    float64
 10  num_follows                    576 non-null    int64  
 11  is_fake                        576 non-null    object 
dtypes: float64(3), int64(4), object(5)
memory usage: 5

In [4]:
data.describe()

Unnamed: 0,rel_num_numeric_char_username,words_fullname,rel_num_numeric_char_fullname,description_length,num_posts,num_followers,num_follows
count,576.0,576.0,576.0,576.0,576.0,556.0,576.0
mean,0.163837,1.460069,0.036094,22.623264,107.489583,88359.09,508.381944
std,0.214096,1.052601,0.125121,37.702987,402.034431,926257.3,917.981239
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,0.0,0.0,0.0,38.0,57.5
50%,0.0,1.0,0.0,0.0,9.0,141.5,229.5
75%,0.31,2.0,0.0,34.0,81.5,728.5,589.5
max,0.92,12.0,1.0,150.0,7389.0,15338540.0,7500.0


In [5]:
data["words_fullname"].describe()

count    576.000000
mean       1.460069
std        1.052601
min        0.000000
25%        1.000000
50%        1.000000
75%        2.000000
max       12.000000
Name: words_fullname, dtype: float64

In [6]:
data["words_fullname"].mean()

np.float64(1.4600694444444444)

In [7]:
data["words_fullname"].std()

np.float64(1.0526005867227521)

In [8]:
data["profile_pic"].mode()

0    yes
Name: profile_pic, dtype: object

In [9]:
data["profile_pic"].value_counts()

profile_pic
yes    404
no     172
Name: count, dtype: int64

In [10]:
data.select_dtypes(include="number").corr()

Unnamed: 0,rel_num_numeric_char_username,words_fullname,rel_num_numeric_char_fullname,description_length,num_posts,num_followers,num_follows
rel_num_numeric_char_username,1.0,-0.225472,0.408567,-0.32117,-0.157442,-0.063386,-0.172413
words_fullname,-0.225472,1.0,-0.094348,0.272522,0.07335,0.033532,0.094855
rel_num_numeric_char_fullname,0.408567,-0.094348,1.0,-0.117521,-0.057716,-0.027388,-0.067971
description_length,-0.32117,0.272522,-0.117521,1.0,0.144824,0.006369,0.226561
num_posts,-0.157442,0.07335,-0.057716,0.144824,1.0,0.325895,0.098225
num_followers,-0.063386,0.033532,-0.027388,0.006369,0.325895,1.0,-0.011009
num_follows,-0.172413,0.094855,-0.067971,0.226561,0.098225,-0.011009,1.0


In [11]:
data.select_dtypes(exclude="number")

Unnamed: 0,profile_pic,name=username,has_external_url,is_private,is_fake
0,yes,no,no,no,no
1,yes,no,no,no,no
2,yes,no,no,yes,no
3,yes,no,no,no,no
4,yes,no,no,yes,no
...,...,...,...,...,...
571,yes,no,no,no,yes
572,yes,no,no,no,yes
573,yes,no,no,no,yes
574,yes,no,no,no,yes
