# Tutorial: Hello Bash and Python

In this tutorial we will familiarise ourselves with bash and python, and Notebooks (inadvertantly). Please load this notebook in `colab.research.google.com` if you do not have a local instance of JupyterHub/JupyterLab running.


Submission:

The submission for this tutorial requires a submission on Git, as well as one on SUNLearn. You will receive an email to your student account asking you to create an account on Gitlab




In [1]:
import pandas as pd

## Question 1: Bash

Retrieve data and interogate it with bash before using python tooling. This is useful as you may struggle with type or malformed files and a quick interogation may reveal those issues.


### Question 1.1

Run the bash command `wget` to retrieve a file located at `https://storage.googleapis.com/bdt-beam/users_v.csv` [1]

In [2]:
!wget -q 'https://storage.googleapis.com/bdt-beam/users_v.csv' -O 'users_v.csv'

### Question 1.2

Use a bash command to view the top ten elements of the file (to confirm that things are as you expect) [1]

In [3]:
! head users_v.csv

user_id,name,gender,age,address,date_joined
1,Anthony Wolf,male,73,New Rachelburgh-VA-49583,2019/03/13
2,James Armstrong,male,56,North Jillianfort-UT-86454,2020/11/06
3,Cody Shaw,male,75,North Anne-SC-53799,2004/05/29
4,Sierra Hamilton,female,76,New Angelafurt-ME-46190,2005/08/26
5,Chase Davis,male,31,South Bethmouth-WI-18562,2018/04/30
6,Sierra Andrews,female,21,Ryanville-MI-69690,2007/05/25
7,Ann Stone,female,41,Smithmouth-SD-17340,2005/01/05
8,Karen Santos,female,34,Mariaville-AK-29888,2003/12/12
9,Ronald Meyer,male,41,North Cherylhaven-NJ-04197,2015/11/14


### Question 1.3

Use a bash command to determine the number of rows in the file [1]

In [4]:
!cat users_v.csv | wc -l

2358


### Question 1.4

Suppose the file is too large for initial exploration, let's take a quick sample so that we can continue working to see what is in the data set. Loading it into Pandas at this point will mean that we are using all that memory in any case, so let's sample it before we load it.

Take a random sample of the file (limited the result to 1000 lines) and create another file called `users_sample.csv`, using only bash commands [3]

Hint: redirect a stream into a the output file.

In [5]:
!shuf -n 10 users_v.csv > users_sample.csv

### Question 1.5 

* Sort your file by ascending `user_id`s [1]
* Overwrite the current `users_sample.csv` with the ordered content [1]
* Print the last ten lines of this file [1]

In [7]:
! sort -n users_sample.csv

17,Mikayla Jacobson,female,67,Harrisonhaven-ID-71589,2005/02/23
854,Jill Bell,female,72,Port Emily-MS-14625,2007/06/03
1042,Heather Barnes,female,70,Bakerhaven-MI-36460,2019/03/01
1265,Melissa Cordova,female,43,Stewartshire-IA-12779,2015/11/08
1272,Connor Palmer,male,26,Nicolefurt-WI-72889,2000/11/03
1390,Julia Vaughan,female,65,Franklinshire-GA-30754,2002/07/22
1728,Jennifer Casey,female,70,Theresabury-TN-05417,2013/03/01
1976,William Morgan,male,27,Coxhaven-HI-65271,2009/01/30
1993,Donna Cooper,female,41,Port Levi-MD-67124,2011/12/28
2301,Willie Villegas,male,40,East Drewchester-FL-95044,2007/08/03


## Question 2: Python

Perform analysis with Python

### Question 2.1

Load the original `users_v.csv` into a Pandas dataframe [1]

In [8]:
! sort -n users_sample.csv >users_sample_ordered.csv

### Question 2.2

Display/print the top ten lines of the dataframe [1]



In [11]:
!tail users_sample_ordered.csv

17,Mikayla Jacobson,female,67,Harrisonhaven-ID-71589,2005/02/23
854,Jill Bell,female,72,Port Emily-MS-14625,2007/06/03
1042,Heather Barnes,female,70,Bakerhaven-MI-36460,2019/03/01
1265,Melissa Cordova,female,43,Stewartshire-IA-12779,2015/11/08
1272,Connor Palmer,male,26,Nicolefurt-WI-72889,2000/11/03
1390,Julia Vaughan,female,65,Franklinshire-GA-30754,2002/07/22
1728,Jennifer Casey,female,70,Theresabury-TN-05417,2013/03/01
1976,William Morgan,male,27,Coxhaven-HI-65271,2009/01/30
1993,Donna Cooper,female,41,Port Levi-MD-67124,2011/12/28
2301,Willie Villegas,male,40,East Drewchester-FL-95044,2007/08/03


### Question 2.3

Find the age of the oldest and youngest person in the dataset [1]

Hint: you can use the `print(..., ...)` function to display the two values if you construct it as two arguments

In [None]:
data = pd.read_csv('https://storage.googleapis.com/bdt-beam/users_v.csv')
data.head(10)


In [None]:
print(min(data.age), max(data.age))

### Question 2.4

Draw descriptive statistics (one-liner) for the `age` column - these statistics should include `count`, `mean`, and `std` [1]

Hint: this command has a parallel in R

In [None]:
data.describe()

### Question 2.5

* Using anonymous functions (`lambda`), create a `surname` column from the name column (you may assume that the last word without a space is the surname) [2]
* Display the last 10 lines of your dataframe [1]


### Question 2.6

* Convert `date_joined` to a pandas series of type `datetime`  [1]
* Overwrite the original `date_joined` column with the result [1]

## Question 3: Git

Push your notebook to Git. If you don't have any Git tooling installed on your machines, download a preferred tool.

 * Create a repository (named `day1-tutorial`) on Gitlab (check your student email for sign-up/membership request to Gitlab) [1]
 * Push this notebook to that repository [1]

## The End

Now that it is a datetime, we can how many users signed up/registered.

In [None]:
import matplotlib

%matplotlib inline 

df.user_id.groupby([df.date_joined.dt.year]).count().plot(kind="bar")