*General hints:* <br>
* You may use another notebook to test different approaches and ideas. When complete and mature, turn your code snippets into the requested functions in this notebook for submission. 
* Make sure the function implementations are generic and can be applied to any dataset (not just the one provided).
* Add explanatory code comments in the code cells. Make sure that these comments improve our understanding of your implementation decisions.

-----
* Create a variable holding your student id, as shown below. 
* Simply replace the example (`01234567`) with your actual student id having a total of 8 digits. 
* Maintain the variable as a string, do NOT change its type in this notebook!
* *Note: If your student id has 7 digits, add a leading 0. The final student id MUST have 8 digits!*

In [2]:
mn = '01453741'

## 0. Import

Implement a function `tidy` which imports the data set assigned and provided to you as a CSV file into a `pandas` dataframe. Access the data set and establish whether your data set is tidy. If not, clean the data set before continuing with Step 1. Mind all rules of tidying data sets in this step. Make sure you comply to the following statements:
* If there is an index column in your dataset, keep it.
* It is ok that one person has multiple jobs. You do not need to split those.
* The dataset should have a total of 8 columns (not including the index), the first column should be `full_name`.
* Mind the intended content of each attribute (e.g., `full_name` should contain the full name of a person, no need to change that)
* Have the function `tidy` return the ready data set as a dataframe.

Note that `tidy` must take a single parameter that holds the basename of the CSV file (i.e., the name without file extension). Do NOT change the name of the file, do not overwrite the original data file, and make sure you submit your final ZIP following the [Code of Conduct](https://datascience.ai.wu.ac.at/ws20/dataprocessing1/code_of_conduct.html) requirements. Especially, make sure you put your data file in a folder called `data/` when submitting.

In [3]:
import pandas as pd
import numpy as np
import os
import re
from decimal import Decimal # used for cleaning coordinates column, to convert them to float numbers which is much more convenient
from IPython.display import display # used for displaying dataframes like table in Jupyter env.

miss_chars = ['NA', '-inf', 'inf', 'nan', None, 'None','', ' ', 0, 'NaN'] # possible missing values



def tidy(x):
    """Tidying-up the dataframe"""
    x = str(x) + '.csv'
    path = os.path.join('data', x) # system independet path, so it wont fail across systems
    df = pd.read_csv(path, sep=',', na_values = miss_chars)
    df = df.T # transposing the df
    df = df.rename(columns={0: 'full_name', 1:'automotive', 2:'color', 3:'job', 4:'address', 5:'coordinates', 6:'company_name'}) # adding header
    
    
    def find_comp(comp):
        """Extracts company name from messed-up column"""
        result = re.sub(r'[^a-zA-Z \-,]+', '', comp) # extracting 
        result = result.lstrip('-').replace('-', ' and ') # removing '-' that is left on left side of the string from Date and replacing '-' left between names with ' and '' because that's how I think it should be done based on other company names.
        return result
    
    def check_date(date):
        """Extracts Date from messed-up column"""
        result = re.findall('((?:19|20)\\d\\d)-(0?[1-9]|1[012])-([12][0-9]|3[01]|0?[1-9])', date) # regex that extracts only valid dates from a string
        if len(result) == 1:
            return '-'.join(result[0]) # joining tuple
        else:
            return 'NaN' # if it's not valid date it will only return 'NaN'
    
    def clean_coords(cords):
        """Extracts only numbers from Decimal() in coordinates column"""
        if isinstance(cords, str):
            res = eval(cords) # converting string to tuple
            if isinstance(res, tuple):
                return float(res[0]), float(res[1]) # returning new tuple that contains only numbers
            else:
                return np.NaN # returns NaN if coordinates column failes to meet conditions for reformatting
        else:
            return cords
        
    def clean_address(add): # removing values with only numbers from address column since these are irrelevant
        """If only numbers are present, it will replace it with NaN"""
        try:
            add_check = int(add)
            return np.NaN
        except:
            return add

        
    def clean_color(color):
        """Splits color names and makes them lowercase so they look more natural"""
        if isinstance(color, str):
            result = re.findall('[A-Z][^A-Z]*', color)
            return ' '.join(result).lower() # returns lowercase whitespace devided color names
        else:
            return np.NaN

        
    # applying all of above functions to dataframe
    df['date'] = df['company_name'].apply(lambda date: check_date(date)) # making new column with extracted dates
    df['company_name'] = df['company_name'].apply(lambda company: find_comp(company)) # cleaning column so it only keeps company names
    df['coordinates'] = df['coordinates'].apply(lambda coords: clean_coords(coords)) # cleaning coordinates column
    df['address'] = df['address'].apply(lambda add: clean_address(add)) # cleaning addresss column
    df['color'] = df['color'].apply(lambda color: clean_color(color)) # cleaning color column
    
    df.replace(miss_chars, np.NaN, inplace=True) # replacing all possible 'missing characters' with np.NaN so they can be read with dataFrame.isna() or dataFrame.isnull() *really important for next two parts*
    
    return df

display(tidy(mn)) # displaying the dataframe in IPython manner

Unnamed: 0,full_name,automotive,color,job,address,coordinates,company_name,date
0,Rachel Santos,670 SJV,white smoke,"Horticulturist, commercial",Lake Dustin,"(29.2588815, 68.330046)",Pham Inc,2020-09-05
1,Brandi Castillo,89-22271,medium aqua marine,Ambulance person,Blanchardtown,"(-57.8942475, -126.431194)",Ashley and Michael,2020-07-09
2,Paula Dodson,L62 8UK,sky blue,Pension scheme manager,Port Karenfurt,"(50.146492, -157.040417)","Castro, Meyer and Smith",2020-05-22
3,Barbara Garner,804LCR,dark slate gray,Data scientist,,"(16.274987, -59.948235)",Hughes and Bowman,2020-06-10
4,Teresa Allen,9JW04,dark salmon,Industrial/product designer,Meganland,"(-8.875462, -57.949751)",Bennett and Wilson,2020-10-19
...,...,...,...,...,...,...,...,...
1706,Lisa Williams,1-C4003,lavender,Jewellery designer,Sawyerstad,"(15.909499, -175.838439)",Wood Group,2020-04-06
1707,Julie Schaefer,GJ3 5912,cornflower blue,"Research officer, government",Jessicaville,"(-83.589762, -52.721594)",Bryan LLC,2020-09-19
1708,Michelle Lane,406TIE,crimson,Manufacturing engineer,Donaldsonside,"(-78.7827135, 147.130558)",Lyons and Perez,2020-04-08
1709,Cheryl Rowe,JVX G25,ghost white,"Teacher, special educational needs",Gateshaven,"(42.5015945, -11.491344)",Bowers and Miller,2020-04-28


In [4]:
from nose.tools import assert_equal
import pandas
assert_equal(type(tidy(mn)), pandas.core.frame.DataFrame)
assert_equal(len((tidy(mn)).columns), 8)
assert_equal(list((tidy(mn)).columns)[0], "full_name")


-------
## 1. Missing values

### 1.1 Code part
Implement a function called `missing_values` which takes as an input a dataframe and check whether there are any missing values in the dataset. Record the row ids of the observations containing missing values as a list of numbers and make sure that the function returns the recorded list in the end. If there are no missing values, `missing_values` should return an empty list. Mind the following:

* Missing values may be encoded using any or multiple of the following special-purpose values: `NA`, `-inf`, `inf`, `nan`, `None`, ` `, `0`, or the empty string.
* There will be at least one, but likely many more missing values.

In [14]:
def missing_values(df): 
    miss_chars = ['NA', '-inf', 'inf', 'nan', None, 'None','', ' ', 0, 'NaN']
    df.replace(miss_chars, np.NaN, inplace=True)
    list_of_miss_idx = []
    for index, row in df.iterrows():
        if row.isna().any() == True:
            list_of_miss_idx.append(index)
    return list_of_miss_idx


print(missing_values(tidy(mn)))

['3', '12', '27', '41', '57', '109', '119', '131', '140', '158', '167', '182', '224', '227', '268', '273', '287', '308', '331', '337', '366', '383', '385', '387', '409', '413', '433', '457', '487', '494', '507', '512', '529', '532', '556', '608', '628', '645', '651', '685', '703', '750', '766', '806', '808', '816', '840', '853', '883', '939', '953', '959', '976', '1016', '1034', '1051', '1089', '1105', '1141', '1172', '1176', '1180', '1254', '1276', '1288', '1297', '1321', '1330', '1339', '1354', '1362', '1371', '1378', '1441', '1463', '1484', '1520', '1558', '1580', '1584', '1621', '1641', '1674', '1691', '1697']


In [6]:
from nose.tools import assert_equal
from nose.tools import assert_true
assert_equal(type(missing_values(tidy(mn))), list)
assert_true(len(missing_values(tidy(mn))) > 0)


### 1.2. Analytical part

* Does the dataset contain missing values?
* If no, explain how you proved that this is actually the case.
* If yes, describe the discovered missing values. What could be an explanation for their missingness?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!


By running `print(len(missing_values(tidy(mn))` I can see how many values are missing. In my case it is 85 values. (`print(df.isna().sum().sum())` also does the same job)

There are missing values in every column. I collected all indexes and run  
`for row_idx in missing:
     display(df.loc[[str(row_idx)]])` to check/see them manually. 
  
With running following code on dataFrame I can see how many missing values I have per column.  
`for column in df.columns:
    result = df[column].isnull().sum()
    print(column, result)` This can also be done with `df.isna().sum()` which is much more simpler and gives cleaner output.  

full_name        9  
automotive       5  
color           11  
job              8  
address         20  
coordinates     12  
company_name    10  
date            10  
dtype: int64  

By running `print(round(df.isna().count().sum() / df.count().sum(), 2), '%')` I get 1.01% as output, which means only around 1% of all values are missing in my dataset. Same can be done for each column, but I think there is no need for that right now.

There are many possible cases why values are missing in dataset. But I would say, that in this case they were not inputed correctly in the database or because some of the data was generated automatically.

------
## 2. Handling missing values
### 2.1. Code part
Implement a function called *handling_missing_values* for handling all types of missing values and all their occurrences in our data set. For each missing-value type, and the corresponding variable, choose an appropriate strategy. Make use of the techniques learned in Unit 4. Do NOT simply drop the missing values. Do NOT apply a single technique only. The function should take as an input a dataframe holding missings and return the updated dataframe w/o missings.

In [7]:
def handling_missing_values(x):
    
    # since I know from previous step that I have missing values in each column I will have to deal with each of them separately
    
    x['full_name'].fillna(method='ffill', inplace=True) # I would say that full_name column doesn't have much impact
#     x.dropna(subset=['automotive'], inplace=True) # I consider automotive as one of things that should be unique so I will drop rows that dont have value in automotve column. Well test doesn't allow dropping. :(
    x['automotive'].fillna(0, inplace=True) # not useful at all
    x['color'].fillna(x['color'].value_counts().index[0], inplace=True) # I will replace color with most common one since I consider it is not very important
    
    # For all other columns I will just use imputation with LOCF and one bfill
    x['job'].fillna(method='ffill', inplace=True)
    x['date'].fillna(method='ffill', inplace=True)
    x['company_name'].fillna(method='ffill', inplace=True)
    x['address'].fillna(method='ffill', inplace=True)
    x['coordinates'].fillna(method='bfill', inplace=True) #  uses next valid observation to fill the gap
    
    return x

In [8]:
from nose.tools import assert_equal
assert_equal(len(missing_values(handling_missing_values(tidy(mn)))), 0)
assert_equal(handling_missing_values(tidy(mn)).shape, tidy(mn).shape)

### 2.2. Analytical part
Discuss the implications. What are the benefits and disadvantages of the adopted strategies?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

I implemented basic techniques like mode, LOCF. Since I have a lot of categorical values I wasn't really able to implement any other tehniques.  
With LOCF I might have created some duplicates and I think that could be a potential disadvantage of choosing this strategy.  
Since only 1% of all values are missing, I think that this approach is fair enough and would get the job done.   
I also think that choosing the right strategy depends on the task that should be done with this dataset. 

I wasn't in good health last 2 weeks, so I wasn't able to dive deeper into this matter of choosing the right strategies. But, will definitely do that. Apologies.


-----
## 3. Duplicate entries
Implement a function called `duplicates` that takes as an input a (tidy) dataframe `x`. Assume that `duplicates` receives a dataframe as returned from your Step 0 implementation of `tidy`. It then checks whether there are any duplicates in the dataset. Record the row ids of the observations being duplicates and have `duplicates` returns the list in the end. An empty list indicates the absence of duplicated observations.

In [9]:
def duplicates(x):
    df = x[x.duplicated(subset=['automotive', 'full_name'],keep='last')]
    return [idx for idx in df.index] # returning list of row indexes of duplicates

print(duplicates(tidy(mn)))

['15', '23', '50', '55', '69', '81', '104', '114', '116', '142', '146', '156', '170', '200', '243', '247', '250', '257', '261', '284', '302', '305', '319', '340', '362', '372', '379', '428', '448', '464', '498', '503', '515', '537', '564', '570', '574', '598', '634', '639', '653', '697', '714', '717', '730', '738', '769', '776', '798', '823', '827', '834', '838', '842', '880', '926', '929', '941', '945', '947', '963', '966', '985', '987', '1001', '1009', '1019', '1022', '1032', '1036', '1039', '1056', '1065', '1075', '1083', '1099', '1101', '1118', '1131', '1147', '1231', '1238', '1246', '1265', '1300', '1313', '1332', '1343', '1357', '1367', '1383', '1388', '1391', '1405', '1413', '1425', '1431', '1434', '1446', '1456', '1459', '1475', '1522', '1537', '1554', '1572', '1576', '1595', '1607', '1630', '1633']


In [15]:
from nose.tools import assert_equal
assert_equal(type(duplicates(tidy(mn))), list)


-----
## 4. Handling duplicate entries
### 4.1. Code part
Implement a function called `handling_duplicate_entries` for handling duplicate entries. Again, the function is assumed to receive a tidied data set as obtained from Step 0. It deduplicates the tidy data set. The function then returns the dataframe without duplicates.

In [11]:
def handling_duplicate_entries(x):
    x.drop_duplicates(subset=['automotive', 'full_name'], keep='last', inplace=True)
    return x

In [12]:
from nose.tools import assert_equal
assert_equal(len(duplicates(handling_duplicate_entries(tidy(mn)))), 0)

### 4.2. Analytical part
Discuss the implications. What are the benefits and disadvantages of the chosen deduplication strategy?

Write your answer in the markdown cell bellow. Do NOT delete or replace the answer cell with another one!

I consider combination of full_name and automotive to be unique key, so I will find all rows with duplicate values of these two columns combined and only keep last occurrence of them. Because I think the last one is the updated one. Next I will find all other duplicates that contain same values in all columns. I will also keep first occurrence of these duplicates, so I will keep that data in the dataset. Advantage of this strategy is that I will remove all previously inputed data that might have been changed, so I will have only updated data. When it comes to rows that are completely same as some other rows in this dataset I would say that only one occurrence is enough. In some other datasets, rows might be the same but still not be the duplicates. I don't know any example yet, but parameter keep=False in drop_cuplicates function suggests that this is possible.