# Put your questions in order starting here

# Pandas Categorical Part 1

Split out column entries
Load following survey data into a Pandas dataframe called x and note that the top part of the Is there anything in particular you want to use Python for? column looks like the following,
	Is there anything in particular you want to use Python for?
ID	
3931	Data extraction and processing, Data analytics...
4205	Data extraction and processing
3669	Data analytics, Machine learning, Statistical ...
1452	Data extraction and processing, Data analytics...
2968	Numerical processing, Data analytics, Machine ...
The problem with this column is that there are multiple comma-separated values in it. Please write a Python function called split_count that can take this column as input and output the following Pandas dataframe.
	count
All of the above	1
Computer vision	1
Image Processing	1
Computer vision/image processing	1
As a general skill	1
scripting seems desirable for many jobs	1
not sure	1
Computer Vision	1
EDA tools	1
Web development	104
Numerical processing	173
Scientific visualization	198
Statistical analysis	222
Data extraction and processing	291
Data analytics	351
Machine learning	381

Here is the function signature: split_count(x) where x is a pd.Series object and it returns a pd.DataFrame object.

**Validation Tests** <br>
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
assert isinstance(x,pd.Series)

**Functional Tests** <br>
Check function output matches expected result enlist all cases used for testing

In [None]:
import pandas as pd
survey_data = pd.read_csv('survey_data.csv') 
x = survey_data.loc[:,'Is there anything in particular you want to use Python for?']

new_s = x.str.split(', ').apply(lambda y: pd.Series(y).value_counts()).sum()                
df = new_s.to_frame()                                                                       
df.columns = ['count']                                                                      
df = df.astype({'count':int})                                                               
assert isinstance(split_count(x),pd.DataFrame) # asserting the output is a dataframe
assert split_count(x) == df # assert output of function is dataframe generated above

# Pandas Categorical Part 2

Add a new column using Timestamp column
Using the same survey dataframe from before, create a dataframe column month-yr with ID as row-index like the following,
	month-yr
ID	
3931	Sep-2017
4205	Sep-2017
...	...
2524	Jan-2019
Note that each of the entries is a string. That is, given that your original survey dataframe is x, you should be able to produce the output above from
>>> x['month-yr'] 
Your function add_month_yr(x) should take in the x survey dataframe and then output the same dataframe with a new month-yr column.
Here is the function signature: add_month_yr(x) where x is a pd.DataFrame and returns the same pd.DataFrame with the new column. This means all you have to do is take the input dataframe and add a single column to it.

**Validation Tests** <br>
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
assert isinstance(x,pd.DataFrame)

**Functional Tests**<br>
Check function output matches expected result enlist all cases used for testing

In [None]:
import pandas as pd
x = pd.read_csv('survey_data.csv') 
assert isinstance(add_month_yr(x),pd.DataFrame) # assert output of function is a dataframe
x['month-yr'] = pd.to_datetime(x['Timestamp']).dt.strftime('%b-%Y')
assert add_month_yr(x) == x # assert output of function is x dataframe with month-yr column

##  Pandas categorical Part 3
Write a function count_month_yr to create the following dataframe using your new column month-yr,

 | |Timestamp|
 |----|-----|
 |month-yr|  | 
 |Apr-2018| 28 | 
 |Feb-2018 | 2 |
 |Jan-2018 | 148 |
 |Mar-2018 | 41 |
 |Oct-2018 | 6 |
 |Sep-2018 | 74 |
 |Sep-2018 | 130 |

Notice that the order of the dates is incorrect. We will fix that later. Remember to include your add_month_yr code from the previous part, as your new function needs the output from it.

Here is the function signature: count_month_yr(x) where x is a pd.DataFrame that returns a pd.DataFrame.

Please put your Python code in a Python script file and upload it. Please retain your submitted source files! Remember to use all the best practices we discussed in class. You can use any module in the Python standard library, but third-party modules (e.g., Numpy, Pandas) are restricted to those explicitly mentioned in the problem description.

After you have submitted your file, do not use the browser back or reload buttons to navigate or open the page in multiple browser tabs, as this may cause your attempts to decrease unexpectedly. It may take up to thirty seconds for your code to be processed, so please be patient.

Good luck!

**Validation Tests** <br>
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
import pandas as pd
assert(isinstance(x, pd.DataFrame)) #input x must be a Pandas Dataframe, if some non dataframe input is sent, should raise assertion error

**Functional Tests** <br>
Check function output matches expected result enlist all cases used for testing

In [None]:
df = pd.read_csv('survey_data_.csv')
x = (add_month_yr(df))
assert(isinstance(x,pd.DataFrame))     #checking if add_month_yr function returns a Pandas dataframe

In [None]:
y = (count_month_yr(x))
assert(isinstance(y,pd.DataFrame))   #checking if count_month_yr function returns a Pandas dataframe

In [None]:
d = {'month-yr': ['Apr-2018', 'Feb-2018', 'Jan-2018', 'Jan-2019', 'Mar-2018', 'Oct-2018', 'Sep-2017', 'Sep-2018' ], 'Timestamp': [28, 2, 148,57, 41, 6, 74, 130]}
dfn = pd.DataFrame(data=d)
dfn.set_index('month-yr', inplace=True)
dfn = dfn.sort_index()
y = y.sort_index()
assert(dfn.equals(y)) #checks if the dataframe consists of columns month-yr and Timestamp and has the required entries

# Pandas categorical Part 4

The problem with our previous result is that the order is wrong. Convert the month-yr column dtype to a Pandas CategoricalDtype with the correct order. You should be able to reproduce the following statement,

x.groupby('month-yr')['Timestamp'].count().to_frame().sort_index() 



| month-yr | Timestamp |
| --- | --- |
| Sep-2017 | 74 |
| Jan-2018 | 148 |
| Feb-2018 | 2 |
| Mar-2018 | 41 |
| Apr-2018 | 28 |
| Sep-2018 | 130 |
| Oct-2018 | 6 |
| Jan-2019 | 57 |

Note that the groupby is now sorted correctly. Your function signature is fix_categorical(x). It should take the month-yr dataframe column and then return the same dataframe with an updated column of CategoricalDtype that does the sorting as described. Remember to include your add_month_year code from the previous part, as your new function needs the output from it.
Here is your function signature fix_categorical(x) where x is a pd.DataFrame with the required "month-yr" column and output is a pd.DataFrame with the "month-yr" column having the categorical dtype.

**Validation Tests**
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
import pandas as pd

assert isinstance(x, pd.DataFrame), "Input x must be a Pandas Dataframe"
assert "month-yr" in x, "input must include 'month-yr' column"

**Functional Tests**
Check function output matches expected result enlist all cases used for testing

In [None]:
import pandas as pd
# Test case 1 - ensure format of output is correct
d = {'month-yr': ['Sep-2017', 'Aug-2019'], 'Timestamp': [3, 2]}
df = pd.DataFrame(data=d, index=d['Timestamp'])
df.columns = ['month-yr', 'Timestamp']
index = df.index
index.name = 'Timestamp'
y = fix_categorical(df)
assert isinstance(y, pd.DataFrame), "y must be a pandas DataFrame"
assert 'month-yr' in y, "y must contain the month-yr column"
assert 'Timestamp' in y, "y must contain the Timestamp column (so that we can run the command 'x.groupby('month-yr')['Timestamp'].count().to_frame().sort_index()')"
assert y['month-yr'].dtype.name == 'category', "y['month-yr'] must be a PandasCategoricalDType order"

In [None]:
import pandas as pd
# Test case 2 - test ordering on smaller dataset. This tests that given a list of out-of-order dates (with multiple instances of some dates), 
# fix_categorical puts them in the correct chronological order
d = {'month-yr': ['Sep-2017', 'Aug-2019', 'Apr-2019', 'Apr-2011', 'Apr-2011', 'Aug-2019', 'Aug-2019'], 'Timestamp': [2, 4, 3, 0, 1, 5, 6]}
df = pd.DataFrame(data=d, index=d['Timestamp'])
df.columns = ['month-yr', 'Timestamp']
index = df.index
index.name = 'Timestamp'
y = fix_categorical(df)
assert y['month-yr'].values.tolist() == ['Apr-2011', 'Apr-2011', 'Sep-2017', 'Apr-2019', 'Aug-2019', 'Aug-2019', 'Aug-2019']

In [None]:
import pandas as pd
# Test case 3 - test ordering on larger dataset.
d = {'month-yr': ['Sep-2017', 'Aug-2019', 'Oct-2017', 'Jan-2017', 'Mar-2017', 'Feb-2017', 'Jan-2017', 'Jan-2018', 'May-2017', 'Jul-2017', 'Jun-2017', \
    'Apr-2019', 'Apr-2011', 'Apr-2011', 'Aug-2019', 'Dec-2017', 'Aug-2019', 'Sep-2017', 'Nov-2017'], 'Timestamp': [10, 16, 11, 2, 5, 4, 3, 14, 6, 8, 7, \
        15, 0, 1, 17, 13, 18, 9, 12]}
df = pd.DataFrame(data=d, index=d['Timestamp'])
df.columns = ['month-yr', 'Timestamp']
index = df.index
index.name = 'Timestamp'
y = fix_categorical(df)
assert y['month-yr'].values.tolist() == ['Apr-2011', 'Apr-2011', 'Jan-2017', 'Jan-2017', 'Feb-2017', 'Mar-2017', 'May-2017', 'Jun-2017', 'Jul-2017', 'Sep-2017', \
    'Sep-2017', 'Oct-2017', 'Nov-2017', 'Dec-2017', 'Jan-2018', 'Apr-2019', 'Aug-2019', 'Aug-2019', 'Aug-2019']

In [None]:
import pandas as pd
# Test case 4 - check that command 'x.groupby('month-yr')['Timestamp'].count().to_frame().sort_index()' produces expected output
d = {'month-yr': ['Sep-2017', 'Aug-2019', 'Sep-2017', 'Apr-2011', 'Apr-2011', 'Apr-2011', 'Apr-2011', 'Jan-2018', 'Apr-2011', 'Apr-2011', 'Apr-2011', \
    'Apr-2019', 'Apr-2011', 'Apr-2011', 'Aug-2019', 'Sep-2017', 'Aug-2019', 'Apr-2011', 'Sep-2017'], 'Timestamp': [10, 16, 11, 2, 5, 4, 3, 14, 6, 8, 7, \
        15, 0, 1, 17, 13, 18, 9, 12]}
df = pd.DataFrame(data=d, index=d['Timestamp'])
df.columns = ['month-yr', 'Timestamp']
index = df.index
index.name = 'Timestamp'
y = fix_categorical(df)
Counts = y.groupby('month-yr')['Timestamp'].count().to_frame().sort_index() 
assert Counts['Timestamp'].values.tolist() == [10, 4, 1, 1, 3]

# Rational Numbers

Implement a class of rational numbers (ratio of integers) with the following interfaces and behaviours



**Validation Tests** <br>
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
# For Rational(numerator, denominator)
assert isinstance(numerator, int), "the numerator must be an integer"
assert isinstance(denominator, int), "the denominator must be an interger"
assert denominator != 0, "the denominator must be non-zero"

**Functional Tests** <br>
Check function output matches expected result enlist all cases used for testing

In [None]:
assert repr(Rational(10,1)) == '10', "check for __repr__ implementation"
assert Rational(20,2) == Rational(10,1), "check for __eq__ implemention and simplification"
assert (sorted([Rational(10,3), Rational(3,10), Rational(5,2), Rational(3,10)]) == [Rational(3,10), Rational(3,10), Rational(5,2), Rational(10,3)]), "check for sorting functionality"
assert Rational(-1,5) + Rational(11,4) * Rational(100,8) - Rational(2,8) == Rational(1357,40), "check for __sub__, __mul__ and __add__ implementations"
assert -Rational(123,2)/7 + 2/Rational(28,5) == Rational(-59, 7), "check for __rtruediv__ and __truediv__ with integer"
assert float(Rational(257,125)) == 2.056, "check for float implementation"
assert int(Rational(10,1)) == 10 , "check for int implementation"

# Rational Square Root


Using your Rational class for representing rational numbers, write a function square_root_rational which takes an input rational number x and returns the square root of x to absolute precision abs_tol. Your function should return a Rational number instance as output. Here is an example,

`square_root_rational(Rational(1112,3),abs_tol=Rational(1,1000)) # output is Rational instance
 10093849/524288`
 


Here is your function signature: square_root_rational(x,abs_tol=Rational(1,1000)).

Hint: Use the bisection algorithm to compute the square root.

**Validation Tests** <br>
Check for corner cases and constraints in the inputs enlist all cases used for testing

In [None]:
assert(isinstance(abs_tol, Rational))

**Functional Tests** <br>
Check function output matches expected result enlist all cases used for testing

In [None]:
import random

# set the low, high range of numerators and denominators to test
low = 1
high = 10000

#how many test trials we want to generate
n = 1
nums = [random.randint(low, high) for i in range(n)]
dens = [random.randint(low, high) for i in range(n)]
tols = [random.randint(low, high) for i in range(n)]

# zip 'em and run 'em
# NOTE: Can definitely take a while for large irrational numbers and/or small tolerances
for num, den, tol in zip(nums, dens, tols):
    operand = Rational(num, den)
    abs_tol = Rational(1,tol)
    assert abs(float(square_root_rational(operand, abs_tol)) - (num/den)**(1/2)) < float(tol), f"[Rational Square Root] Test failed for testcase (numerator={num}, denominator={den}, tol={tol})"
