# Data Wrangling

```python
import pandas as pd
import numpy as np
```

## Series

```python
# Series from lists
pd.Series(data=[ ... ],
          index=[...],
          name=''
         )

# Series from dicts 
pd.Series(data={
    ...
})
```

## Data Frames

### Creating Data Frames 

Data Frames can be made from: 
- list of lists
- ndarray
- dict
- list of tuples
- Series

```python
# Data Frames from list of lists
pd.DataFrame(data=[[ ... ],
                   [ ... ]
                  ],
             index=['row1', 'row2'],
             columns=['col1', 'col2']
            )

# Data Frames from dicts 
pd.DataFrame(data={
    'col1' : [...],
    'col2' : [...]
}, index=['row1', 'row2'])
```

### Reading data

```python 
# read file
pd.read_csv(path, 
            usecols=range(start, end),
            index_col=0,
            parse_dates=True
           )


# glimpse of df
df.head()
df.tail()

# data types of df
df.info() 
df.dtypes

# summary statistics
df.describe(include='all')

# organizing columns
df.rename(columns= {"old_col1" : "new_col1",
                    "old_col2" : "new_col2"})
df.columns = list_of_column_names

# convert types 
df.column.astype(type)

# set max df rows to view
pd.set_option("display.max_rows", n)

# conditional coloring
df.style.background_gradient(cmap='rainbow')
```

### NA values

```python
# boolean condition
df.isna()

# showing slices containing NA
mask = df.isnull().any(axis=1)
df[mask]

# drop NA
df.dropna()

# fill NA
df.fillna(0)
df.fillna(df.mean())
```

### Indexing and Slicing

| Method | Syntax |
|--------|--------|
| Select column/Series | `df[col_label]` | 
| Select sliced df | `df[[col_labels]]` | 
| Select row slice | `df[row_1_int:row_2_int]` | 
| Select row/column by label | `df.loc[row_label(s), col_label(s)]` |
| Select row/column by integer | `df.iloc[row_int(s), col_int(s)]` |
| Select by row integer & column label | `df.loc[df.index[row_int], col_label]` |
| Select by row label & column integer | `df.loc[row_label, df.columns[col_int]]` | 
| Select by boolean | `df[bool_vec]` | 
| Select by boolean expression | `df.query("expression")` | 

```python
# set column as index
df = df.set_index(col)

df = df.reset_index()

df.sort_values(by='col', ascending=True)
df.sample()
df.min(axis=0, numeric_only=True)
df.idxmin()
```

### Query

```python
df.query("column1 > condition1 & column2 > condition2")
df.query("column > @variable")
```

### Data Manipulation

```python
# drop columns
df.drop(columns=list_of_col)

# drop rows
df.drop(df.index[x:], axis=0)

# pivot longer
df.melt(id_vars= , 
        value_vars= , 
        var_name=, 
        ignore_index=False)

# pivot wider
df.pivot(index= ,
         columns= ,
         values= )

df.pivot_table()

# append data frame vertically
pd.concat((df1, df2), axis=0)

#append data frame horizontally
pd.concat((df1, df2), axis=1, ignore_index=True)

df.merge(df1, df2, how="inner, outer, left, right", on="column")

# apply function column-wise, row-wise
# for functions that accept series 
df.apply(function, parameters=...)

# apply function element-wise
# for functions that accepts single values 
df.applymap(function, parameters=...)
```

```python
# show groups as dict
df.groupby(by='col_name').groups

# slice by groups
df.groupby(by='col_name').get_group('group_value')

# summarize by group
df.groupby(by='col_name').mean()

# apply multiple functions
df.groupby(by='col_name').aggregate(['mean', 'sum', 'count'])

# apply different functions to different columns
df.aggregate({
    'column1' : ['function'],
    'column2' : ['function', ...]
})
```

### Columns containing strings


Method | Description
:------|:------------
Series.str.cat | Concatenate strings
Series.str.split | Split strings on delimiter
Series.str.rsplit | Split strings on delimiter working from the end of the string
Series.str.get | Index into each element (retrieve i-th element)
Series.str.join | Join strings in each element of the Series with passed separator
Series.str.get_dummies | Split strings on the delimiter returning DataFrame of dummy variables
Series.str.contains | Return boolean array if each string contains pattern/regex
Series.str.replace | Replace occurrences of pattern/regex/string with some other string or the return value of a callable given the occurrence
Series.str.repeat | Duplicate values (s.str.repeat(3) equivalent to x * 3)
Series.str.pad | "Add whitespace to left, right, or both sides of strings"
Series.str.center | Equivalent to str.center
Series.str.ljust | Equivalent to str.ljust
Series.str.rjust | Equivalent to str.rjust
Series.str.zfill | Equivalent to str.zfill
Series.str.wrap | Split long strings into lines with length less than a given width
Series.str.slice | Slice each string in the Series
Series.str.slice_replace | Replace slice in each string with passed value
Series.str.count | Count occurrences of pattern
Series.str.startswith | Equivalent to str.startswith(pat) for each element
Series.str.endswith | Equivalent to str.endswith(pat) for each element
Series.str.findall | Compute list of all occurrences of pattern/regex for each string
Series.str.match | "Call re.match on each element, returning matched groups as list"
Series.str.extract | "Call re.search on each element, returning DataFrame with one row for each element and one column for each regex capture group"
Series.str.extractall | "Call re.findall on each element, returning DataFrame with one row for each match and one column for each regex capture group"
Series.str.len | Compute string lengths
Series.str.strip | Equivalent to str.strip
Series.str.rstrip | Equivalent to str.rstrip
Series.str.lstrip | Equivalent to str.lstrip
Series.str.partition | Equivalent to str.partition
Series.str.rpartition | Equivalent to str.rpartition
Series.str.lower | Equivalent to str.lower
Series.str.casefold | Equivalent to str.casefold
Series.str.upper | Equivalent to str.upper
Series.str.find | Equivalent to str.find
Series.str.rfind | Equivalent to str.rfind
Series.str.index | Equivalent to str.index
Series.str.rindex | Equivalent to str.rindex
Series.str.capitalize | Equivalent to str.capitalize
Series.str.swapcase | Equivalent to str.swapcase
Series.str.normalize | Return Unicode normal form. Equivalent to unicodedata.normalize
Series.str.translate | Equivalent to str.translate
Series.str.isalnum | Equivalent to str.isalnum
Series.str.isalpha | Equivalent to str.isalpha
Series.str.isdigit | Equivalent to str.isdigit
Series.str.isspace | Equivalent to str.isspace
Series.str.islower | Equivalent to str.islower
Series.str.isupper | Equivalent to str.isupper
Series.str.istitle | Equivalent to str.istitle
Series.str.isnumeric | Equivalent to str.isnumeric
Series.str.isdecimal | Equivalent to str.isdecimal

### Regular Expressions

```python
import re
```

Method | Description
:------|:-----------
match() | Call re.match() on each element, returning a boolean.
extract() | Call re.match() on each element, returning matched groups as strings.
findall() | Call re.findall() on each element
replace() | Replace occurrences of pattern with some other string
contains() | Call re.search() on each element, returning a boolean
count() | Count occurrences of pattern
split() | Equivalent to str.split(), but accepts regexps
rsplit() | Equivalent to str.rsplit(), but accepts regexps

### Date Time 

```python
from datetime import datetime, timedelta

# create datetime from string
datetime.strptime('string', '%format %to %extract')
pd.to_datetime(string_of_dates, format='%date %format')

# extract time elements from date
d.strftime('%Y') # extract year

```

