<a href="https://colab.research.google.com/github/mariaxclarisse/EDA/blob/main/Univariate_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Exploratory Data Analysis (EDA) (Intro)

is the process of analyzing data sets to summarize their characteristics using a combination of statistical calculations and data visualizations.

**Univariate** means "*one variable*" or one type of data. It does not deal with causes or relationships between two or more variables.

A (statistical) variable in the data science and statistics context refers to a series of data (two too many) representing measured values of a single concept accross multiple cases.

There a three types of univariate statistis that we are interested in for each variable::

1. General information (data type, count of total values, num of unique values)
2. Range and middle: (min, max, mean, median, mode, quartiles)
3. Normality and spread: (standard dev, skeness, kurtosis)

### Univariate Statistics

#### General Information

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Datasets/insurance.csv')

In [3]:
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [4]:
df.shape

(1338, 7)

In [5]:
df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [6]:
df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

In [7]:
# Print the count of non-empty values for each feature
print('age: ' + str(df.age.count()))
print('sex: ' + str(df.sex.count()))
print('bmi: ' + str(df.bmi.count()))
print('children: ' + str(df.children.count()))
print('smoker: ' + str(df.smoker.count()))
print('region: ' + str(df.region.count()))
print('charges: ' + str(df.charges.count()))

print('\n')

# Print the number of unique values for each feature
print('age: ' + str(df.age.nunique()))
print('sex: ' + str(df.sex.nunique()))
print('bmi: ' + str(df.bmi.nunique()))
print('children: ' + str(df.children.nunique()))
print('smoker: ' + str(df.smoker.nunique()))
print('region: ' + str(df.region.nunique()))
print('charges: ' + str(df.charges.nunique()))

age: 1338
sex: 1338
bmi: 1338
children: 1338
smoker: 1338
region: 1338
charges: 1338


age: 47
sex: 2
bmi: 548
children: 6
smoker: 2
region: 4
charges: 1337


In [8]:
# Print the column data types one-at-a-time:
print(df.nunique())
print()
print(df.dtypes)

age           47
sex            2
bmi          548
children       6
smoker         2
region         4
charges     1337
dtype: int64

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object


In [9]:
print('age: ' + str(df.age.dtype))
print('sex: ' + str(df.sex.dtype))
print('bmi: ' + str(df.bmi.dtype))
print('children: ' + str(df.children.dtype))
print('smoker: ' + str(df.smoker.dtype))
print('region: ' + str(df.region.dtype))
print('charges: ' + str(df.charges.dtype))

age: int64
sex: object
bmi: float64
children: int64
smoker: object
region: object
charges: float64


In [10]:
print('age: ' + str(pd.api.types.is_numeric_dtype(df.age)))
print('sex: ' + str(pd.api.types.is_numeric_dtype(df.sex)))
print('bmi: ' + str(pd.api.types.is_numeric_dtype(df.bmi)))
print('children: ' + str(pd.api.types.is_numeric_dtype(df.children)))
print('smoker: ' + str(pd.api.types.is_numeric_dtype(df.smoker)))
print('region: ' + str(pd.api.types.is_numeric_dtype(df.region)))
print('charges: ' + str(pd.api.types.is_numeric_dtype(df.charges)))

age: True
sex: False
bmi: True
children: True
smoker: False
region: False
charges: True


In [11]:
print('age: ' + str(df.age.isna().sum()))
print('sex: ' + str(df.sex.isna().sum()))
print('bmi: ' + str(df.bmi.isna().sum()))
print('children: ' + str(df.children.isna().sum()))
print('smoker: ' + str(df.smoker.isna().sum()))
print('region: ' + str(df.region.isna().sum()))
print('charges: ' + str(df.charges.isna().sum()))

age: 0
sex: 0
bmi: 0
children: 0
smoker: 0
region: 0
charges: 0


In [12]:
print(df.charges.min())
print(df.charges.quantile(.25))
print(df.charges.quantile(.50))
print(df.charges.quantile(.75))
print(df.charges.max())

print(df.charges.mean())
print(df.charges.median())
print(df.charges.mode().values[0])

1121.8739
4740.28715
9382.033
16639.912515
63770.42801
13270.422265141257
9382.033
1639.5631


In [13]:
df.charges.std()

12110.011236694001

In [14]:
import numpy as np

np.std(df.charges)

12105.484975561612

#### Boundaries and Middle

Boundaries = max and min values

In [15]:
myList = [1,2,3,4,5,6,7,8,9,10]
print(max(myList))
print(min(myList))

10
1


In [16]:
# Using a Python list
import numpy as np

myList = [1,2,3,4,5,6,7,8,9,10]
print(np.quantile(myList, .25))
print(np.quantile(myList, .50))
print(np.quantile(myList, .75))

3.25
5.5
7.75


In [17]:
import statistics as stat

myList = [1,2,3,4,5,6,7,8,9,10]
stat.mean(myList)

5.5

In [18]:
import pandas as pd

# First we create a dataframe to test this on. Notice that we first add
# the data, then specify a series of column headers and row index names
# (as opposed to index numbers)
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])

# Now print it to see what it looks like
print(df)

         Apple  Orange  Banana  Pear
Basket1     10      20      30    40
Basket2      7      14      21    28
Basket3     55      15       8    12
Basket4     15      14       1     8
Basket5      7       1       1     8
Basket6      5       4       9     2


In [19]:
# Now we can use pandas built in mean() function to see the mean for each column
print(df.mean())

Apple     16.500000
Orange    11.333333
Banana    11.666667
Pear      16.333333
dtype: float64


In [20]:
# Using the same dataset as above
df.median() # or df.median(axis="columns") if you want to view median by columns

Apple      8.5
Orange    14.0
Banana     8.5
Pear      10.0
dtype: float64

In [21]:
# Using the same dataset as above
df.mode() #Or df.mode(axis="columns") if you want to find the mode by columns instead

Unnamed: 0,Apple,Orange,Banana,Pear
0,7,14,1,8


#### Standard Deviation

In [22]:
# Using Python list
import numpy as np
myList = [1,2,3,4,5,6,7,8,9,10]
print(np.std(myList, ddof=1)) # The parameter 'ddof=1' is used to change the default std to sample mode (s)

# Using Pandas DataFrame column
import pandas as pd
df = pd.DataFrame(data=[1,2,3,4,5,6,7,8,9,10], columns=['numbers'])
print(df.numbers.std())       # Assumes a sample std (s) by default

3.0276503540974917
3.0276503540974917


In [23]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Datasets/insurance.csv')

print(df.charges.std())
print(df['charges'].std())
print(df.std())

12110.011236694001
12110.011236694001
age            14.049960
bmi             6.098187
children        1.205493
charges     12110.011237
dtype: float64


  print(df.std())


#### Normality: Skew, Kurt

In [24]:
# Using Python list
from scipy.stats import kurtosis, skew
myList = [1,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9,9,10]
print(skew(myList, bias=False))
print(kurtosis(myList, bias=False))

# Using Pandas DataFrame
import pandas as pd
df = pd.DataFrame(data=[1,2,2,3,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,7,7,7,7,7,8,8,8,9,9,10], columns=['numbers'])
print(df.numbers.skew())
print(df.numbers.kurt())

-0.01972922271337009
-0.03905580479600701
-0.01972922271337009
-0.0390558047960079


In [25]:
from scipy.stats import kurtosis, skew
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Datasets/insurance.csv')

print(df.charges.skew())
print(df['charges'].skew())
print(df[['charges', 'bmi', 'children', 'age']].skew())
print()
print(df.charges.kurt())
print(df['charges'].kurt())
print(df[['charges', 'bmi', 'children', 'age']].kurt())

1.5158796580240388
1.5158796580240388
charges     1.515880
bmi         0.284047
children    0.938380
age         0.055673
dtype: float64

1.6062986532967907
1.6062986532967907
charges     1.606299
bmi        -0.050732
children    0.202454
age        -1.245088
dtype: float64
