# Lab - Object Oriented Programming

# Challenge 2

In order to understand the benefits of simple object-oriented programming, we have to build up our classes from the beginning. 

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
chars = ['a', 'b', 'c','d', 'e', 'f', ' ', 'á','é','ó']

def create_weird_dataframe(size=10):
    def create_weird_colnames(size=size):
        probs = [.2,.2,.15,.1,.1,.1,.05,.05,.025,.025]

        return [''.join(
            [(char.upper() if np.random.random() < 0.2 else char) 
                     for char in np.random.choice(chars,size=12, p=probs)]) for i in range(size)]
    
    data = np.random.random(size=(size,size))
    colnames = create_weird_colnames(size)
    return pd.DataFrame(data=data, columns=colnames)

In [4]:
df = create_weird_dataframe()

In [5]:
df

Unnamed: 0,aeCfa Áddeód,CáadbeaFdbéc,dcbdbeécdaCf,eéaecbbaóa,cabecáacecAa,daceacabDaaB,ebdbbbFaaace,eác daaAfbEe,acbbbb cÁdca,aCca eáecafe
0,0.199187,0.398402,0.975118,0.846886,0.98759,0.347504,0.937634,0.222918,0.735547,0.438256
1,0.453212,0.224651,0.519387,0.966593,0.292769,0.242348,0.305621,0.03169,0.081942,0.425011
2,0.965383,0.091156,0.265362,0.079468,0.613876,0.575571,0.702044,0.518878,0.667119,0.418099
3,0.672424,0.672365,0.234631,0.723158,0.877574,0.638436,0.133321,0.313445,0.873641,0.57343
4,0.481717,0.145632,0.348596,0.209323,0.253371,0.620669,0.719409,0.027697,0.419593,0.741562
5,0.959883,0.038726,0.880266,0.886011,0.190402,0.875896,0.217855,0.047598,0.950035,0.594544
6,0.292821,0.736027,0.612898,0.428007,0.06254,0.210433,0.533596,0.304629,0.889518,0.220318
7,0.684929,0.031012,0.292201,0.36437,0.773122,0.516914,0.041169,0.444118,0.115684,0.8264
8,0.617526,0.566558,0.057939,0.854259,0.327145,0.810291,0.279184,0.184537,0.633631,0.225516
9,0.608918,0.635304,0.064255,0.204148,0.518271,0.543936,0.226176,0.66469,0.901027,0.416823


## Correcting the column names

### let's start simple: get the column names of the dataframe.

Store it in a variable called `col_names`


In [6]:
col_names = df.columns
col_names

Index(['aeCfa Áddeód', 'CáadbeaFdbéc', 'dcbdbeécdaCf', 'eéaecbbaóa  ',
       'cabecáacecAa', 'daceacabDaaB', 'ebdbbbFaaace', 'eác daaAfbEe',
       'acbbbb cÁdca', 'aCca eáecafe'],
      dtype='object')

### Let's iterate through this columns and transform them into lower-case column names

Create a list comprehension to do that if possible. Store it in a variable called `lower_colnames`

In [7]:
lower_colnames = [col.lower() for col in col_names]

In [8]:
lower_colnames

['aecfa áddeód',
 'cáadbeafdbéc',
 'dcbdbeécdacf',
 'eéaecbbaóa  ',
 'cabecáacecaa',
 'daceacabdaab',
 'ebdbbbfaaace',
 'eác daaafbee',
 'acbbbb cádca',
 'acca eáecafe']

### Let's remove the spaces of these column names!

Replace each column name space ` ` for an underline `_`. Again, try to use a list comprehension to do that. 
For this first task use `.replace(' ','_')` method to do that.

In [9]:
[col.replace(' ','_') for col in lower_colnames]

['aecfa_áddeód',
 'cáadbeafdbéc',
 'dcbdbeécdacf',
 'eéaecbbaóa__',
 'cabecáacecaa',
 'daceacabdaab',
 'ebdbbbfaaace',
 'eác_daaafbee',
 'acbbbb_cádca',
 'acca_eáecafe']

### Create a function that groups the results obtained above and return the lower case underlined names as a list

Name the function `normalize_cols`. This function should receive a dataframe, get the column names of a it and return the treated list of column names.

In [10]:
def normalize_cols(dataframe):
    """
    Receive a dataframe, get its columns, put it in lower 
    case and then replace spaces by underlines.
    """
    colnames = dataframe.columns
    lower_colnames = [col.lower() for col in colnames]

    return [col.replace(' ','_') for col in lower_colnames]

### Test your results

Use the following line of code to test your results. Run it several times to see some behaviors.

In [11]:
normalize_cols(create_weird_dataframe())

['dabbácc_ccóa',
 'dc_db_fdabfe',
 'bebaebacfbfb',
 'ebed_cccfcáa',
 'ab_bdábacbde',
 'ádeaabcabcbe',
 'ccaabcaábfbc',
 'badacaeabább',
 'aaaafbafcecc',
 '_éddcadffccó']

### hmmm, we made a mistake!

We've commited several mistakes by doing this. Have observed any bugs associated with our results?

In order for us to see some problems in our results, we have to look for edge cases. 

For example: 

**Problem #1:** what if there are 2 or more following spaces? We want it to replace the spaces by several underlines or condense them into one?

**Problem #2:** what if there are spaces at the beginning? Should we substitute them by underline or drop them?

Let's correct each problem. Starting by problem 2.

## Correcting our function

Instead of substituting the spaces at first place, let's remove the trailing and leading spaces!

Recreate the `normalize_cols` with the solution to `Problem 2`.

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [12]:
def normalize_cols(dataframe):
    """
    Receive a dataframe, get its columns, put it in lower 
    case, strip leading and trailing spaces and then replace 
    the remaining spaces by underlines.
    """
    colnames = dataframe.columns
    lower_colnames = [col.lower().strip() for col in colnames]

    return [col.replace(' ','_') for col in lower_colnames]

### Test your results again.

At least, for now, you should not have any trailing nor leading underlines.

In [13]:
normalize_cols(create_weird_dataframe())

['eafdae_éeób',
 'aáaóebbdbcda',
 'ócédaedbeeee',
 'éabbbbáa_bbá',
 'abbafbfbcd_c',
 'adfab_aáfácd',
 'cddcbc_bafb',
 'efacdadbáed',
 'badacfaedbób',
 'cfecfaé_bbbb']

### Correcting problem 1

To correct problem 1, instead of using `.replace()` string method, we want to use a regular expression. Use the module `re` to substitute the pattern of `1 or more spaces` by 1 underline `_`.

Test your solution on the variable below:

In [14]:
import re 

text = 'these spaces      should all be one underline'

In [15]:
re.sub('\s+','_', text)

'these_spaces_should_all_be_one_underline'

### Now correct your `normalize_cols` function

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [16]:
def normalize_cols(dataframe):
    """
    Receive a dataframe, get its columns, put it in lower 
    case, strip leading and trailing spaces. The inner remaining 
    spaces are then substituted by underlines (consecutive spaces
    are ignored).
    """
    colnames = dataframe.columns
    lower_colnames = [col.lower().strip() for col in colnames]

    return [re.sub('\s+', '_', col) for col in lower_colnames]

### Again, test your results.

Now, sometimes some column names should have smaller sizes (because you are removing consecutive spaces)

In [17]:
normalize_cols(create_weird_dataframe())

['b_eeé_fcbbdb',
 'béaábabbaedc',
 'acbbbááaáeae',
 'eaacdfbbadb',
 'caccadcaóéaa',
 'aeaá_ácdae',
 'aabebbedbáác',
 'áfdcbácfadcf',
 'ebffbaáefeda',
 'áfdófbcbcbáf']

## Last step: remove accents

The last step consists in removing accents from the strings.

Import the package `unidecode` to use its module also called `unidecode` to remove accents. Test on the word below.

In [18]:
from unidecode import unidecode

In [19]:
text = 'aéóúaorowó'

In [20]:
unidecode(text)

'aeouaorowo'

### Now remove the accents for each column name in your `normalized_cols` function.

*Hint: Copy and paste the last `normalize_cols` function to change it.*

In [21]:
def normalize_cols(dataframe):
    """
    Receive a dataframe, get its columns, put it in lower 
    case and strip leading and trailing spaces. The inner remaining 
    spaces are then substituted by underlines (consecutive spaces
    are ignored).
    """
    colnames = dataframe.columns
    lower_colnames = [col.lower().strip() for col in colnames]

    return [unidecode(re.sub('\s+', '_', col)) for col in lower_colnames]

### Test your results

In [22]:
normalize_cols(create_weird_dataframe())

['aabddcfdfebb',
 'a_accceaedca',
 'b_daaadaafdc',
 'cceebcbdfbcc',
 'eaaabaoe_cfd',
 'aacbeobcaafd',
 'cacaefdacbf',
 'afaaadfbbfoo',
 'coaafeadefeb',
 'odaaaaaebebb']

## Good job. 

Right now you have a function that receives a dataframe and returns its columns names with a good formatting.

# Creating our own dataframe.

In [23]:
from pandas import DataFrame

A dataframe is just a simple class. It contains its own attributes and methods. 

When you create a pd.DataFrame() you are just instantiating the DataFrame class as an object that you can store in a variable. From this point onwards, you have access to all DataFrame class attributes (`.columns` for example) and methods (`.isna()` for example). We've been using those since always! 

If we wish, we could create our own class inheriting everything from a DataFrame class.

In [24]:
class myDataFrame(DataFrame):
    pass

Instead of just creating myDataFrame, put your function inside your new inherited class, that is, transform `normalize_cols` into a method of your own DataFrame.

Remember you'll have to give self as the first argument of the `normalize_cols`. So you could replace everything you once called `dataframe` inside your `normalize_cols` by `self`. 

At the end, return the list of the correct names.

In [25]:
class myDataFrame(DataFrame):
    
    
    def normalize_cols(self):
        """
        Receive a dataframe, get its columns, put it in lower 
        case and strip leading and trailing spaces. The inner remaining 
        spaces are then substituted by underlines (consecutive spaces
        are ignored).
        """
        colnames = self.columns
        lower_colnames = [col.lower().strip() for col in colnames]

        return [unidecode(re.sub('\s+', '_', col)) for col in lower_colnames]

In [26]:
df = myDataFrame(create_weird_dataframe())

In [27]:
df.normalize_cols()

['de_bfcoabdbb',
 'abcoafeadbfe',
 'acbcebdcbdba',
 'fbbbabebbbbf',
 'dfafbaabobba',
 'fbaab_bebfbe',
 'ffbbaeacdaac',
 'cecaafdbdcae',
 'bcebaabaddac',
 'bdacbaeabade']

## Understanding even more the `self` argument

Instead of returning a list containing the correct columns, you should now assign the correct columns to the `self.columns` - this will effectively replace the values of your object by the correct columns.


Now change your method to return the dataframe itself. That is, return the `self` argument this time and see the results! 

```python
class myDataFrame(DataFrame):
    def normalize_cos(self):
        ...
        return self
```

In [28]:
class myDataFrame(DataFrame):
    
    
    def normalize_cols(self):
        """
        Receive a dataframe, get its columns, put it in lower 
        case and strip leading and trailing spaces. The inner remaining 
        spaces are then substituted by underlines (consecutive spaces
        are ignored).
        """
        colnames = self.columns
        lower_colnames = [col.lower().strip() for col in colnames]

        self.columns = [unidecode(re.sub('\s+', '_', col)) for col in lower_colnames]
        
        return self

In [29]:
df = myDataFrame(create_weird_dataframe())

In [30]:
df.normalize_cols()

Unnamed: 0,aadcdeca_abf,baaaaabbafbc,ecaeobfbabdb,faaeafcedaeo,abaafefeafef,dbdbedcedfbb,fcbbcbccbbfa,eeadcca_dcbe,abdbabaffcfa,ocfc_cdaec_e
0,0.992265,0.013985,0.306525,0.145396,0.699617,0.208849,0.952534,0.853373,0.737597,0.646021
1,0.859571,0.525799,0.555252,0.598558,0.990336,0.935121,0.935251,0.610053,0.326483,0.089921
2,0.967077,0.998587,0.924792,0.535307,0.041123,0.522431,0.838041,0.81377,0.949964,0.428556
3,0.35192,0.829183,0.340245,0.143752,0.928423,0.066436,0.921498,0.183094,0.120287,0.946927
4,0.890208,0.299924,0.840376,0.832451,0.9292,0.12991,0.252889,0.897594,0.376965,0.898157
5,0.317338,0.202602,0.280636,0.374346,0.526599,0.404235,0.473472,0.279555,0.498785,0.18366
6,0.715458,0.308784,0.742676,0.711486,0.459589,0.179296,0.655836,0.682297,0.470768,0.645196
7,0.367267,0.153887,0.444649,0.806179,0.926605,0.374438,0.49657,0.504631,0.731201,0.652302
8,0.914172,0.693937,0.190177,0.282466,0.008858,0.375428,0.968578,0.363458,0.687028,0.891952
9,0.498294,0.079079,0.277786,0.646062,0.769159,0.039017,0.830641,0.124235,0.158045,0.367933
