<img src="https://drive.google.com/uc?id=1E_GYlzeV8zomWYNBpQk0i00XcZjhoy3S" width="100"/>

# DSGT Bootcamp Week 1: Introduction and Environment Setup

# Learning Objectives

1. Gain an understanding of Google Colab
2. Introduction to team project
3. Gain an understanding of Kaggle
4. Download and prepare dataset
5. Install dependencies
6. Gain an understanding of the basics of Python
7. Gain an understanding of the basics of GitHub / Git

# Google Colab

#### Google Colab is a cell-based Python text editor that allows you to run small snippets of code at a time. This is useful for data science; we can divide our work into manageable sections and run / debug independently of each other.

#### Colab lets you store and access data from your Google Drive account (freeing up local storage) and lets you use Google's servers for computing (allowing you to parse bigger datasets).

#### Any given cell will either be a **code cell** or a **markdown cell** (formatted text, like this one). We won't need to focus on using Markdown cells, because you can just use comments in code cells to convey any text that might be important.

---

# Basic Commands:

#### All of these commands assume that you have a cell highlighted. To do so, simply click on any cell. 

#### `shift + enter (return)`:  Runs the current cell, and highlights the next cell

#### `alt (option) + enter (return)`: Runs the current cell, and creates a new cell

#### `Top Bar -> Runtime -> Run all`: Runs entire notebook

#### `+Code or +Text`: Adds a code or text cell below your highlighted cell


### For more information, check out the resources at the end!
---





<img src="https://www.kaggle.com/static/images/site-logo.png" alt="kaggle-logo-LOL"/>

# Introducing Kaggle



#### [Kaggle](https://kaggle.com) is an online 'practice tool' that helps you become a better data scientist. They have various data science challenges, tutorials, and resources to help you improve your skillset.


#### For this bootcamp, we'll be trying to predict trends using the Kaggle Titanic Data Set. This dataset models variable related to the passengers and victims of the Titanic sinking incident. By the end of this bootcamp, you'll submit your machine learning model to the leaderboards and see how well it performs compared to others worldwide.

#### For more information on Kaggle, check out the resources section.

# Accessing the Titanic Dataset

#### To speed up the data download process, we've placed the data in this Google Folder where everyone will be making their notebooks. Let's go over the steps needed to import data into Google Colab.

**PLEASE READ THE STEPS!!!**
1. Go to your Google Drive (drive.google.com) and check **"Shared with me"**
2. Search for a folder named **"Spring 2021 Bootcamp Material"**
3. Enter the **Spring 2021 Bootcamp Material** folder, click the name of the folder (**Spring 2021 Bootcamp Material**) on the bar at the top of the folder to create a drop-down and select **"Add shortcut to Drive"**
4. Select **"My Drive"** and hit **"Add Shortcut"**
5. Enter the **Spring 2021 Bootcamp Material** folder you just made, and navigate to the **"Participants"** subfolder
6. Make a new folder within Participants in the format **"FirstName LastName"**.
7. Return to Google Colab.
8. Go to **"File -> Save a copy in Drive"**. Rename the file to **"firstname-lastname-week1.ipynb"**. It will be placed into a folder named **"Colab Notebooks"** in your Google Drive.
9. Move **"firstname-lastname-week1.ipynb"** to your **Participant** folder within Google Drive.
10. Return to Google Colab.
11. Hit the folder image on the left bar to expand the file system.
12. Hit **"Mount Drive"** to allow Colab to access your files. Click the link and copy the code provided into the textbox and hit Enter.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# This cell should appear once you hit "Mount Drive". Press Shift + Enter to run it.

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
"""
You can use the following commands in Colab code cells. 
Type "%pwd" to list the folder you are currently in and "%ls" to list subfolders. Use "%cd [subfolder]"
to change your current directory into where the data is.
"""
%cd 'drive'

/content/drive


In [None]:
%ls

[0m[01;34mMyDrive[0m/


In [None]:
# Move into one subfolder ("change directory")
%cd 'drive'

/content/drive


In [None]:
%ls

[0m[01;34mMyDrive[0m/


In [None]:
# Move into a nested subfolder
%cd '/MyDrive/Spring 2021 Bootcamp Material/Participants/Data'

/content/drive/.shortcut-targets-by-id/14ismWEVuvc7ESkob1ObgVgyn9LoOSa3h/Spring 2021 Bootcamp Material/Participants/Data


In [None]:
%pwd


'/content/drive/.shortcut-targets-by-id/14ismWEVuvc7ESkob1ObgVgyn9LoOSa3h/Spring 2021 Bootcamp Material/Participants/Data'

As you can see here, we've now located our runtime at "../Participants/Data" where WICData.csv is located. This is the dataset for the WIC Program. For now, understand how you navigate the file system to move towards where your data is.


**Note:** The above code cells could also have simply been 

`cd drive/MyDrive/Spring 2021 Bootcamp Material/Participants/Data` 

It was done one step at a time to show the process of exploring a file system you might not be familiar with. If you know the file path before hand, you can move multiple subfolders at once.

# Project Presentation
Link to Google Slides: [Slides](https://docs.google.com/presentation/d/1QzomRX5kpJTKuy9j2JFvCo0siBEtkPNvacur2ZAxCiI/edit?usp=sharing)


# Read Data with Pandas

#### `!pip install` adds libraries (things that add more functionality to Python) to this environment as opposed to your machine. 

#### We'll worry about importing and using these libraries later. For now, let's just make sure your environment has them installed.

#### Applied Data Science frequently uses core libraries to avoid "reinventing the wheel". One of these is pandas!

In [None]:
!pip install pandas
import pandas as pd #pd is the standard abbreviation for the library



#### Now that we're in the correct folder, we can use pandas to take a sneak peek at the data. Don't worry about these commands -- we'll cover them next week!

In [None]:
df = pd.read_csv("titanic_test.csv")
df.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# Introduction to the Python Programming Language

### **Why do we use Python?**
- Easy to read and understand
- Lots of libraries for Data Science
- One of the most popular languages for Data Science (alongside R)


# Primer on Variables, If Statements, and Loops

In [None]:
#You can create a variable by using an "=" sign. The value on the right gets 
#assigned to the variable name of the left.

a = 5
b = 15
print(a + b)

c = "Data Science "
d = "is fun!"
print(c + d)

20
Data Science is fun!


In [None]:
#If statements allow you to run certain lines of code based on certain conditions.

if (c + d) == "Data Science is fun!":
  print("Correct!")
else: # this section is only triggered if (c + d) doesn't equal "Data Science is fun!"
  print("False!")

Correct!


In [None]:
#For loops are used to perform an action a fixed amount of times, or to go through each element in a list or string
for index in range(0, a):
  print('DSGT')

DSGT
DSGT
DSGT
DSGT
DSGT


In [None]:
#In this block of code, c+d is treated as a list of letters, with letter serving 
#as each individual character as the for loop iterates through the string.
for letter in c + d:
  print(letter)

D
a
t
a
 
S
c
i
e
n
c
e
 
i
s
 
f
u
n
!


# Lists, Tuples, and Dictionaries

In [None]:
# Let's start by creating a list (otherwise known as an array)
c = ["a", "b", "c"]

# We can retrieve an element by accessing its position in the array. 
# Position counting starts at 0 in Python.
print("The 1st item in the array is " + c[0])

# Lists can have more than one type of element!
c[1] = 23
print(c)

The 1st item in the array is a
['a', 23, 'c']


In [None]:
# Tuples are lists but they don't like change
tup = ("car", True, 4)
tup[2] = 5 #would cause an error

TypeError: ignored

In [None]:
# Dictionaries are unordered key, value pairs
d = {"Data Science": "Fun", "GPA": 4, "Best Numbers": [3, 4]}

# We can get values by looking up their corresponding key
print(d["Data Science"])

# We can also reassign the value of keys
d["Best Numbers"] = [99, 100]

# And add keys
d["Birds are Real"] = False


#We can also print out all the key value pairs

print(d)

Fun
{'Data Science': 'Fun', 'GPA': 4, 'Best Numbers': [99, 100], 'Birds are Real': False}


## Functions

In [None]:
# Functions help improve code reusability and readability. 
# You can define a series of steps and then use them multiple times.

def add(a, b):
  sum = a + b
  return sum

print(add(2, 4))
print(add(4, 7))
print(add(3 * 4, 6))

6
11
18


**Note**: A lot of Data Science is dependent on having a solid foundation in Python. If you aren't currently familiar, we *highly recommend* spending some time learning (tutorials available in resources). Otherwise, using the libraries and parsing data in a foreign language may make things rather difficult.

## **Introduction to the Version Control and GitHub**
What is Version Control?
*  Tracks changes in computer files
*  Coordinates work between multiple developers
*  Allows you to revert back at any time
*  Can have local & remote repositories


 What is GitHub?
 *  cloud-based Git repository hosting service
*   Allows users to use Git for version control 
*   **Git** is a command line tool
*   **GitHub** is a web-based graphical user interface

# Set Up

If you do not already have Git on your computer, use the following link to install it:

[Install Git](https://git-scm.com/downloads)

**Setting Up a Repo**


*  $git config 

  *   $git config --global user.name "YOUR_NAME"

  *   $git config --global user.email "YOUR_EMAIL"


**Create a Repo**

*   $git init

*   $git clone [URL]

** You can use
https://github.gatech.edu with
YOUR_NAME = your username that you log into GitHub with
YOUR_EMAIL = your email that you log into GitHub with **

**GitHub GUI**

One person in each team will create a "New Repository" on GitHub. Once they add team members to the repo, anyone can clone the project to their local device using "$git clone [URL]" .

# Steps for Using Git


1.   Check that you are up to date with Remote Repo -- **git fetch**
  *   check status -- **git status** 
  *   if not up to date, pull down changes -- **git pull**

2.   Make changes to code
3.   Add all changes to the "stage" -- **git add .**
4.   Commit any changes you want to make -- **git commit -m [message]**
5.   Update the Remote Repo with your changes -- **git push**


**Summary**

3 stage process for making commits (after you have made a change):


1.   ADD
2.   COMMIT
3.   PUSH


# Branching

By default when you create your project you will be on Master - however, it is good practice to have different branches for different features, people etc.

* To see all local branches --  **git branch** 

* To create a branch -- **git branch [BRANCHNAME]**

* To move to a branch -- **git checkout [BRANCHNAME]**

* To create a new branch **and** move to it -- **git checkout -b [BRANCHNAME]**

# Merging
Merging allows you to carry the changes in one branch over to another branch. Github is your best friend for this - you can open and resolve merge conflicts through the GUI very easily. However, it is also good to know how to do it manually, in the event that you are unable to resolve conflicts. 

**Manual Steps**
1.   $git checkout [NAME_OF_BRANCH_TO_MERGE_INTO]

2.   $git merge [NAME_OF_BRANCH_TO_BRING_IN]



# Helpful Resources

#### [Colab Overview](https://colab.research.google.com/notebooks/basic_features_overview.ipynb)
#### [Kaggle Courses](https://www.kaggle.com/learn/overview)
#### [Kaggle](https://www.kaggle.com/)
#### [Intro Python](https://pythonprogramming.net/introduction-learn-python-3-tutorials/)
#### [Pandas Documentation](https://pandas.pydata.org/docs/)
#### [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
#### [Github Tutorial](https://guides.github.com/activities/hello-world/)
