# Cleaning not-null values

After dealing with many datasets I can tell you that "missing data" is not such a big deal. The best thing that can happen is to clearly see values like `np.nan`. The only thing you need to do is just use methods like `isnull` and `fillna`/`dropna` and pandas will take care of the rest.

But sometimes, you can have invalid values that are not just "missing data" (`None`, or `nan`). For example:

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Sex': ['M', 'F', 'F', 'D', '?'],
    'Age': [29, 30, 24, 290, 25],
})
df

Unnamed: 0,Sex,Age
0,M,29
1,F,30
2,F,24
3,D,290
4,?,25


In [5]:
df['Sex'].unique() # Muestra los valores - util para datos categoricos
df['Sex'].value_counts() # Cuenta las ocurrencias para cada valor posible
df['Sex'].replace({'D': 'M', 'N': 'M'}) # reemplazar valores

0    M
1    F
2    F
3    M
4    ?
Name: Sex, dtype: object

### Duplicates

Checking duplicate values is extremely simple. It'll behave differently between Series and DataFrames

In [8]:
ambassadors = pd.Series([
    'France',
    'United Kingdom',
    'United Kingdom',
    'Italy',
    'Germany',
    'Germany',
    'Germany',
], index=[
    'Gérard Araud',
    'Kim Darroch',
    'Peter Westmacott',
    'Armando Varricchio',
    'Peter Wittig',
    'Peter Ammon',
    'Klaus Scharioth '
])
ambassadors

Gérard Araud                  France
Kim Darroch           United Kingdom
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Peter Wittig                 Germany
Peter Ammon                  Germany
Klaus Scharioth              Germany
dtype: object

In [19]:
ambassadors.duplicated() # False en la primer ocurrencia de duplicados
print(ambassadors.duplicated(keep=False)) # False para cualquier ocurrencia duplicada

# Para eliminar duplicados se utiliza el metodo drop_duplicates(), de manera similara a duplicated()
ambassadors.drop_duplicates(keep='last')

Gérard Araud          False
Kim Darroch            True
Peter Westmacott       True
Armando Varricchio    False
Peter Wittig           True
Peter Ammon            True
Klaus Scharioth        True
dtype: bool


Gérard Araud                  France
Peter Westmacott      United Kingdom
Armando Varricchio             Italy
Klaus Scharioth              Germany
dtype: object

In [18]:
# trabajando con dataframes
players = pd.DataFrame({
    'Name': [
        'Kobe Bryant',
        'LeBron James',
        'Kobe Bryant',
        'Carmelo Anthony',
        'Kobe Bryant',
    ],
    'Pos': [
        'SG',
        'SF',
        'SG',
        'SF',
        'SF'
    ]
})

print(players)
print(players.duplicated(subset=['Name'])) # duplicados para una columna en especial
players.drop_duplicates(subset=['Name'], keep='last') # eliminar duplicados

              Name Pos
0      Kobe Bryant  SG
1     LeBron James  SF
2      Kobe Bryant  SG
3  Carmelo Anthony  SF
4      Kobe Bryant  SF
0    False
1    False
2     True
3    False
4     True
dtype: bool


Unnamed: 0,Name,Pos
1,LeBron James,SF
3,Carmelo Anthony,SF
4,Kobe Bryant,SF


## Text Handling
### Splitting Columns

The result of a survey is loaded and this is what you get:

In [24]:
df = pd.DataFrame({
    'Data': [
        '1987_M_US _1',
        '1990?_M_UK_1',
        '1992_F_US_2',
        '1970?_M_IT_1',
        '1985_F_IT_2'
]})

print(df)
df= df['Data'].str.split('_', expand=True) # split en separador y volverlo dataframe
df.columns = ['Year', 'Sex', 'Country', 'No Children'] #Asociar columns
print(df)

           Data
0  1987_M_US _1
1  1990?_M_UK_1
2   1992_F_US_2
3  1970?_M_IT_1
4   1985_F_IT_2
    Year Sex Country No Children
0   1987   M     US            1
1  1990?   M      UK           1
2   1992   F      US           2
3  1970?   M      IT           1
4   1985   F      IT           2
