# Pandas
<br>
axis = 1 -- sum of rows
<br>
axis = 0 -- sum of cloumns
<br>
any comparison among <b>NaN</b> values will be <b>False</b>

| Method Name | Despriction |
| --- | --- |
| count | Returns the number of non-null entries in the Series |
| unique | Returns the unique values in the Series |
| nunique | Returns the number of unique values in the Series |
| value_counts | Returns a Series of counts of unique values |
| describe | Returns a Series of descriptive stats of values |
| values | Returns an array form of the dataframe |

### Data cleaning
<code>df.drop(columns = [])</code> drop columns in the list<br>
<code>df.size</code> return the total number of entries<br>
<code>df.shape</code> return a tuple of the dimension<br>
<code>df[~(condition)]</code> return a DataFrame that does not satisfies the condition <br>
<code>pd.to_numeric(df[col], downcast = datatype)</code> transform column value to number <br>
<code>pd.dropna(subset=[col1, ..., col3])</code> drop null in specific columns<br>
<code>pd.fillna({'col1': ..., 'col2': ...})</code> fill null wth specified values in dictionary for each columns<br>
<code>df.assign</code> & <code>df.append</code> create a copy of the DataFrame<br>
<code>df.idxmax()</code> largest value in a column when <code>axis=0</code> largest value in a row when <code>axis=1</code>

### Data Organizing
<b>data selecting</b><br>

<code>df[a]</code> return a serie of the column that its name matches <code>a</code><br>
<code>df[[a]]</code> return a dataframe of the columns specified in <code>a</code><br>
<code>df.loc[a]</code> return a series of the row that index matches <code>a</code><br>
<code>df.loc[[a]]</code> return a dataframe of the rows that index matches elements in <code>a</code><br>
<code>df[[bool, ..., bool]]</code> return a dataframe, find specific rows where bool are <code>True</code><br>
<code>df.loc[idx_list, col_list]</code> return a dataframe containing rows in <code>idx_list</code> and columns in <code>col_list</code>. If idx_list or col_list is not a list yet 1 element, return series.<br>
<code>df.loc[bool_array, col_list]</code> return a fataframe containing the rows which <code>bool_array</code> is <code>True</code> and columns in <code>col_list</code><br>
<code>df.loc[1, :] = [..., ..., ...]</code> assign a new serie to the row<br>
<code>df[3:5, 'Name':'PID']</code> slice the selected rows and columns(inclusive) <br>

<b>groupby</b> <br>
<code>df.groupby(col1)[col2].agg['mean', 'median']</code>return a dataframe with the mean and median of<code>col2</code> with index <code>col1</code> <br>
<code>df.groupby(col1).agg({'col2': mean, 'col3': median})</code>return a dataframe with the mean of<code>col2</code>and the median of<code>col3</code>with index<code>col1</code> <br>
<code>df.groupby(col1).transform(lambda x: (x-x.mean() / x.std()))</code> x represent a serie in one group of <code>col1</code>, lambda method will nomalized the value of <code>col2</code> with each groups of <code>col1</code><br>
<code>df.groupby(col1).filter(lambda df: df['col2'].mean() >= limit)</code> filter out the group in <code>col1</code> whose mean value of <code>col2</code> is less than the limit<br>
<code>df.groupby(col1).first()</code> return a dataframe tha contains the first observation within each group of <code>col1</code> <br>

<b>pivot table</b> <br>
<code>df.pivot_table(index='x', columns='y', values='z', aggfunc='mean')</code> return a dataframe with index <code>x</code> and the mean value of <code>z</code> within each group of <code>y</code><br>
<code>aggfunc</code>: count, mean, max, sum, size, ...<br>

<b>concat</b><br>
<code>pd.concat([df1,df2])</code> by default, two df will be stacked together; index of both df won't change; if one column is missing, the element will be filled with <code>NaN</code>; treat concat as outer in merge<br>
<code>pd.concat([df1,df2], axis=1)</code> two df will be combined horizontally; columns with the same name will be kept<br>

<b>merge</b><br>
<code>df1.merge(df2, how=inner, left_on = ..., right_on = ..., suffixes=('_x','_y'))</code> rows that is contained both in <code>left_on</code> and <code>right_on</code> is kept; coorsponding columns of the kept rows are kept; suffix is added to the overlapping column names in <code>df1</code> and <code>df2</code> respectively. 

### String Method
<code>df[col].str.split().str[0]</code> split the string in <code>col</code> ang get the first element of the splitted list<br>
<code>df[col].str.strip(char)</code> clean the begninning and end char<br>
<code>df[col].str.zfill(num)</code> if the length of string is less than <code>num</code>, then filled with zero at the beginning<br>
<code>df[col].str.count(pattern)</code> count the number of appearance for pattern in each row<br>
<code>df[col].str.split().str.len()</code> split the row and get the length for each list

### Visualization
<code>df[col].plot(kind=hist, density=True, bins=20, ec='w')</code> draw a histogram with probability density<br>
<code>df[col].plot(kind=barh, x=..., y=...)</code> draw a horizontal bar plot<br>
<code>pd.plotting.scatter_matrix(df[['col1', 'col2', 'col3']], figsize=(8, 8));</code> draw a scatter_matrix of df beased on the columns selected<br>
<code>df.groupby(col1)[col2].plot(kind='kde', legend=True)</code> plot the probability density function of the unique values in <code>col1</code> in respect to the values of <code>col2</code>

# Numpy
<br>
Can apply function to numpy array without for loop<br>
pd.to_numpy() return a view of the original object, numpy array of array (not a copy of the df)<br>
type coercion - array created with string and int contains only strings.
<code>
np.array['a', 1] = array(['a', '1'], dtype=str)
np.array(['a', 1], dtype=object) = array(['a', 1], dtype=object)
pd.Series(['a', 1]).values = array(['a', 1], dtype=object)
pd.Series([1, 1.0]).values = array([1.0, 1.0], dtype=float)
</code>

# Random
Generate a non-uniform random sample from np.arange(5) of size 3: <br>
<code>np.random.choice(5, p=[0.1, 0, 0.3, 0.6, 0], size=3) = array([3, 3, 0]) # random</code><br>
Flip a coin 20 times: <br>
<code>np.random.multinomial(20, [1/2., 1/2.], size=1) = array([[8, 12]])</code><br>

# Hypothese Test
- two distribution are <b>categoical</b> data, use TVD
- two distribution are <b>numerical</b> distribution, use difference in group means or median

numerical<br>

In [None]:
df = pd.DataFrame(np.random.choice(['H', 'T'], p=[0.55, 0.45], size=(N, 114)))
result = df.mean(axis=1) 

categorical<br>

In [None]:
df = np.random.multinomial(N, [0.1, 0.4, 0.5], size=num_rep) / N #get percentage data to calculate tvd
np.sum(np.abs(temp - observed.to_numpy()), axis=1) / 2 #calculate tvd
tvd = np.sum(np.abs(serie1-serie2)) / 2 # tvd only for categorical data
(np.array(results) >= onserved).mean() # pval

# Permutation Test
normal
```python
to_shuffle = smoking_and_birthweight.copy()
weights = to_shuffle['Birth Weight'].values

observed_difference = (smoking_and_birthweight.groupby('Maternal Smoker')['Birth Weight'].mean().diff().iloc[-1])
for _ in range(n_repetitions): #normal approach 
    # Step 1: Shuffle the weights
    shuffled_weights = np.random.permutation(weights)   
    # Step 2: Put them in a DataFrame
    to_shuffle['Shuffled Birth Weight'] = shuffled_weights  
    # Step 3: Compute the test statistic
    group_means = (
        to_shuffle
        .groupby('Maternal Smoker')
        .mean()
        .loc[:, 'Shuffled Birth Weight']
    )
    difference = group_means.diff().iloc[-1]  
    # Step 4: Store the result
    faster_differences.append(difference)
pval = (difference >= obs_diff).mean()
```
<b>faster</b>

In [None]:
is_smoker = smoking_and_birthweight['Maternal Smoker'].values #boolean array
weights = smoking_and_birthweight['Birth Weight'].values  #boolean array
n_smokers = is_smoker.sum()
n_non_smokers = 1174 - n_smokers

is_smoker_permutations = np.column_stack([
    np.random.permutation(is_smoker)
    for _ in range(3000)
]).T

mean_smokers = (weights * is_smoker_permutations).sum(axis=1) / n_smokers
mean_non_smokers = (weights * ~is_smoker_permutations).sum(axis=1) / n_non_smokers
ultra_fast_differences = mean_smokers - mean_non_smokers

Use <b>Hypothesis Test</b> when given 1 sample from the population <br>
Use <b>permutation Test</b> when given 2 observed sample from different population

# Missingness
### Type of Missingness
<b>Missing by design(MD)</b> - If you can determine whether a value is missing solely using other columns<br>
<b>Not missing at random(NMAR or NI)</b> - The chance that a value is missing <b>depends on the actual missing value</b>!<br>
<b>Missing at random(MAR)</b> - Missing mainly depends on other columns, <b>not</b> the missing value itself <br>
<b>Missing completely at random(MCAR)</b> - This missingness does not depend on other column or the value itself<br>
- use permutation tests to verify if a column is MAR vs. MCAR.
    - Create two groups: one where values in a column are missing, and another where values in a column aren't missing.
    - To test the missingness of column X:
        - For every other column, test the null hypothesis "the distribution of (other column) is the same when column X is missing and when column X is not missing."
        - If you fail to reject the null, then column X's missingness does not depend on (other column).
        - If you reject the null, then column X is MAR dependent on (other column).
        - **If you fail to reject the null for all other columns, then column X is MCAR!**

### Imputation
<b>listwise deletion</b> - dropping entire rows that contain missing values<br>
- porcedure: <code>.dropna()</code>
- could delete perfectly good data in other given columns <br>
- Drop missing data only when working with the column that contains missing data <br>
- if data is MCAR, won't affect statistic of data<br>

<b>mean imputation</b><br>
- procedure: <code>.fillna(df[col].mean())</code> <br>
- Preserves the mean of the observed data, for all types of missingness<br>
- Decreases the variance of the data, for all types of missingness<br>
- Creates a biased estimate of the true mean when the data are not MCAR<br>
- if data is MCAR, the result mean is unbiased estimate of the true mean<br>

<b>conditional mean imputation</b><br>
- Since MAR data are MCAR within each group, perform group wise mean imputation
- e.g.
```python
def mean_impute(ser):
    return ser.fillna(ser.mean())
heights_mar_cat.groupby('gender')['child'].transform(mean_impute)
```
- increase correlations between columns
- if data MAR, result mean is unbiased estimators of the true mean, but variance is low
- if missing values depend on more than 1 column, use linear regression to predict missing value

<b>probabilistic mean imputation</b><br>
- fill in missing data by drawing from the distribution of the <b>non-missing</b> data.
- e.g.
```python
#Figure out the number of missing values
num_null = heights_mcar['child'].isna().sum()
#Sample that number of values from the observed dataset
fill_values = heights_mcar.child.dropna().sample(num_null, replace=True)
#Fill in the missing values with the sample from Step 2
# Find the positions where values in heights_mcar are missing
fill_values.index = heights_mcar.loc[heights_mcar['child'].isna()].index
# Fill in the missing values
heights_mcar_dfilled = heights_mcar.fillna({'child': fill_values.to_dict()})  # fill the vals
```
- need multiple imputation to reduce randomness - similar to bootstrapping

### The Kolmogorov-Smirnov test statistic
The K-S test statistic measures the similarity between two distributions -- it does not quantify if one distribtution is larger then the other on average<br>
The K-S statistic is roughly defined as the largest difference between two CDFs<br>
<b>when to use K-S</b><br>
If the distributions have similar shapes but are centered in different places, use the <b>difference in means</b> (or absolute difference in means) <br>
If your alternative hypothesis involves a "direction" (i.e. smoking weights were are on average than non-smoking weights), use the <b>difference in means</b> <br>
If the distributions have different shapes and your alternative hypothesis is simply that the two distributions are different, use the <b>K-S statistic</b> <br>
e.g.<br>
H0: the missingness of 'child' is not dependent on 'father'.<br>
```python
heights_mcar['child_missing'] = heights_mcar['child'].isna()
# 'father' when 'child' is missing 
father_ch_mis = heights_mcar.loc[heights_mcar['child_missing'], 'father']
# 'father' when 'child' is not missing
father_ch_not_mis = heights_mcar.loc[~heights_mcar['child_missing'], 'father']
stats.ks_2samp(father_ch_mis, father_ch_not_mis)
```
result is <code>KstestResult(statistic=0.055674518201284794, pvalue=0.4645992385588452)</code><br>
fail to reject null hypothesis(MCAR); if pvalue close to 0, reject null and conclude MAR