# Week 1

## Week 1 Lesson 1
## Welcome + course overview
1 - 23 May 2017

### Course overview
* First, get a basic level of all the different areas of Data Science - theory, programming, statistics, visualisation, communication, subject matter expertise.
* Then get to go into further detail in more advanced areas.

### Additional readings
* Python for Data Analysis:  http://shop.oreilly.com/product/0636920023784.do
* Command Line Crash Course: http://cli.learncodethehardway.org/book/

---

## Week 1 Lesson 2
## Git + Github, Numpy + Pandas
2 - 25 May 2017

### Git
* Version control system that allows you to track files and file changes in a repository (“repo”)
* Primarily used by software developers
* Most widely used version control system
* Alternatives: Mercurial, Subversion, CVS
* Runs from the command line (usually)
* Can be used alone or in a team

### Github
* Allows you to put your Git repos online
* Alternative: Bitbucket
* Benefits of GitHub:
 * Backup of files
 * Visual interface for navigating repos
 * Makes repo collaboration easy
* Git does not require GitHub

<img src=img/git_chart.png width=500 align=left>

### Numpy

* Numpy can be used to perform general maths functions, and also create arrays of data!

In [None]:
import numpy as np

In [None]:
a = np.array( [20,30,40,50] )
a

In [None]:
b = np.arange(4)
b

In [None]:
c = a-b
c

In [None]:
d = b*2
d

In [None]:
e = np.random.randint(1,10,(2,3,4))
e

In [None]:
e[:1]

### Pandas
* Pandas is all about that reading. You can read files using pandas, including CSV files, URLs.

In [None]:
import pandas as pd

#### Importing data

In [None]:
pd.read_table('../data/u.user', header=None)
user_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
users = pd.read_table('../data/u.user', sep='|', header=None, names=user_cols, index_col='user_id', dtype={'zip_code':str})

#### Exploring the data

In [None]:
users                   # print the first 30 and last 30 rows
type(users)             # DataFrame
users.head()            # print the first 5 rows
users.head(10)          # print the first 10 rows
users.tail()            # print the last 5 rows
users.index             # "the index" (aka "the labels")
users.columns           # column names (which is "an index")
users.dtypes            # data types of each column
users.shape             # number of rows and columns
users.values            # underlying numpy array
users.info()            # concise summary (including memory usage)

In [None]:
#users['gender']         # select one column
#type(users['gender'])   # Series
#users[['gender']]
#type(users[['gender']])   # DataFrame
#users.gender            # select one column using the DataFrame attribute

In [None]:
#users.describe()                    # describe all numeric columns
#users.describe(include=['object'])  # describe all object columns (can include multiple types)
#users.describe(include='all')       # describe all columns
#users.gender.describe()             # describe a single column
users.age.mean()                    # only calculate the mean

In [None]:
#users.occupation.value_counts()     # most useful for categorical variables
users.age.value_counts()        # can also be used with numeric variables

#### Filtering

In [None]:
#young_bool = users.age < 20         # create a Series of booleans...
#users[young_bool]                   # ...and use that Series to filter rows
#users[users.age < 20]               # or, combine into a single step
#users[users.age < 20].occupation    # select one column from the filtered results
users[users.age < 20].occupation.value_counts()     # value_counts of resulting Series

In [None]:
# logical filtering with multiple conditions
#users[(users.age < 20) & (users.gender=='M')]       # ampersand for AND condition
#users[(users.age < 20) | (users.age > 60)]          # pipe for OR condition
users[users.occupation.isin(['doctor', 'lawyer'])]  # alternative to multiple OR conditions

#### Sorting

In [None]:
users.age.sort_values()                   # sort a column
users.sort_values(by='age')                   # sort a DataFrame by a single column
users.sort_values(by='age', ascending=False)  # use descending order instead
users.sort_values(by=['occupation', 'age'])   # sort by multiple columns

#### Grouping

In [None]:
byage = users.groupby('age')
byage.describe()

#### Missing values
scikit-learn models expect that all values are numeric and hold meaning. Thus, missing values are not allowed by scikit-learn. One possible strategy is to just drop missing values:

In [None]:
users.age.value_counts()              # excludes missing values
users.age.value_counts(dropna=False)  # includes missing values

In [None]:
users.age.isnull()           # True if missing, False if not missing
#users.age.isnull().sum()     # count the missing values
#users.age.notnull()          # True if not missing, False if missing
#users[users.age.notnull()]  # only show rows where continent is not missing

In [None]:
# use 'tilde' ~ to negate the boolean values
~users.age.isnull()  

In [None]:
users.isnull()             # DataFrame of booleans
users.isnull().sum()       # count the missing values in each column

Another way is to impute missing values:

In [None]:
users.age.fillna(value='NA')                 # fill in missing values with 'NA'
users.age.fillna(value='NA', inplace=True)   # modifies 'drinks' in-place
# Fill any null values of age by taking average by gender, class and parch
users['age']=users[['occupation','gender']].groupby(['occupation','gender'])['age'].transform(lambda x: x.fillna(x.mean()))

In [None]:
users.sum(axis=0)      # sums "down" the 0 axis (rows)
users.sum()            # axis=0 is the default
users.sum(axis=1)      # sums "across" the 1 axis (columns)

### Additional readings
[Basic Git commands list]("https://confluence.atlassian.com/bitbucketserver/basic-git-commands-776639767.html")

[Good resourses]("https://help.github.com/articles/git-and-github-learning-resources/")