# Prepare for Capital One Technical Interview

## Technical & Design Interview Guidelines

For the **Technical** Interviews:

  -  In the Hands-On/Coding Technical Interview, the primary focus will be walking through a code test or code review with your technical interviewer in your language of choice (This will include small functions, usage, algorithmic knowledge, etc.)

       - This session will be about building a data pipeline (preferred languages for building pipeline are Python, Java, or Scala - but technically you could use anything besides SQL for this part) 

       - Skills tested: scrubbing data, obtaining data, cleaning data and loading data. Once the data is loaded you will need to demonstrate querying skills (for querying data you can use SQL).  

       - Be prepared to solve code, and discuss your reasoning behind the way you solved it - dig deep for the interviewer

  -  In the **Design** focused Technical Interview, there is a specific working problem that it will centralize around:

       - Designing a data pipeline

       - Items to think about: database design concepts, Schemas, data pipeline design, Design Time Vs Run Time of the stack, Designing Data Engineering Solutions at Scale, etc.. 

       - Be prepared to utilize the whiteboarding feature in Zoom

       - Use these [System Design Primer/Topics](https://github.com/kvasukib/system-design-primer*system-design-topics-start-here) to help you prepare

  -  Some general things to also think about:

       - Core programming skills, design philosophy, risk factors, coding standards, etc.

       - Data structures, Object oriented programming & Code optimization.

       - System Design and common architectural patterns

       - API Design & Data Modeling

       - Design Tradeoffs & Performance tuning

       - What motivates you in technology, specific languages, etc.?

       - You should expect some technical questions from the interviewers related to your technical background

       - What do you see as some exciting things you may be able to bring to the table at Capital One?

## Pandas

In [None]:
from datetime import datetime
import pandas as pd

DATA_FILEPATH = r"../../data/beijing_airquality/PRSA_Data_Changping_20130301-20170228.csv"

# data = pd.read_csv(DATA_FILEPATH, chunksize=10000, header=0, index_col='No', on_bad_lines='warn')
df = pd.read_csv(DATA_FILEPATH, header=0, index_col='No', on_bad_lines='warn')

# display(df)
# df.info()
# df.describe(include='all')
df['year'].value_counts(sort=False).to_dict()

display(df)


#### De-duping

In [None]:
print(f"rows before de-dup: {len(df.index)}")
print(f"deduping... ")
df.drop_duplicates(subset=['year', 'month', 'day', 'hour'], inplace=True, ignore_index=True, keep='last')
print(f"rows after de-dup: {len(df.index)}")


#### Basic transforms

In [None]:
# dropping columns
df = df.drop(columns=['PM2.5', 'SO2', 'NO2', 'O3'], errors='ignore')
# change column names to lower case
[df.rename(columns={col: col.lower()}, inplace=True) for col in list(df.columns) if col.isupper()]
# check columns
necessary_columns = ('year', 'month', 'day', 'hour', 'temp', 'pres', 'dewp', 'rain', 'wd', 'wspm', 'station')
assert all([col in list(df.columns) for col in necessary_columns]), f"Missing schema column"
# use efficient data types
print("data types before cast:")
print(df.dtypes)
df['year'] = pd.to_numeric(df['year'], downcast='unsigned')
df['month'] = pd.to_numeric(df['month'], downcast='unsigned')
df['day'] = pd.to_numeric(df['day'], downcast='unsigned')
df['hour'] = pd.to_numeric(df['hour'], downcast='unsigned')
df['pm10'] = pd.to_numeric(df['pm10'], downcast='float')
df['co'] = pd.to_numeric(df['co'], downcast='float')
df['temp'] = pd.to_numeric(df['temp'], downcast='float')
df['pres'] = pd.to_numeric(df['pres'], downcast='float')
df['dewp'] = pd.to_numeric(df['dewp'], downcast='float')
df['rain'] = pd.to_numeric(df['rain'], downcast='float')
df['wspm'] = pd.to_numeric(df['wspm'], downcast='float')
df['wd'] = df['wd'].astype('category')
df['station'] = df['station'].astype('category')
print("data types after cast:")
print(df.dtypes)
# create date
df['mdate'] = df.apply(lambda row: datetime(year=row['year'], month=row['month'], day=row['day'], hour=row['hour']), axis='columns')
display(df)

#### Detecting and Handling nulls

In [None]:
df.isnull().sum()