# Getionando valores nulos

In [0]:
import pandas as pd
from io import StringIO
csv_data = \
  '''A,B,C,D
  1.0,2.0,3.0,4.0
  5.0,6.0,,8.0
  10.0,11.0,12.0,'''
# If you are using Python 2.7, you need
# to convert the string to unicode:
# csv_data = unicode(csv_data)
df = pd.read_csv(StringIO(csv_data))
df
#Using the preceding code, we read CSV-formatted data into a pandas DataFrame via the read_csv function and noticed that the two missing cells were replaced by NaN. 
#The StringIO function in the preceding code example was simply used for the
#purposes of illustration. It allows us to read the string assigned to csv_data into a pandas DataFrame as if it was a regular CSV file on our hard drive.

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,10.0,11.0,12.0,


In [0]:
df.isna().sum()

A    0
B    0
C    1
D    1
dtype: int64

In [0]:
#acceso al array NumPy que tenemos por detrás
df.values

array([[ 1.,  2.,  3.,  4.],
       [ 5.,  6., nan,  8.],
       [10., 11., 12., nan]])

In [0]:
#lo más "fácil"
df.dropna(axis=0)

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [0]:
#también podemos borrar columnas
df.dropna(axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,10.0,11.0


In [0]:
#borrar filas donde todos los valores sea NaN
df.dropna(how='all')

In [0]:
#borrar filas que tienen menos de 4 valores reales
df.dropna(thresh=4)

In [0]:
#borrar filas con NaN en una columna específica
df.dropna(subset=['C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,10.0,11.0,12.0,


In [0]:
#directamente el porcentaje de valores nulos
# https://www.dunderdata.com/post/finding-the-percentage-of-missing-values-in-a-pandas-dataframe 
df.isna().mean().round(4) * 100

A     0.00
B     0.00
C    33.33
D    33.33
dtype: float64

Often, the removal of samples or dropping of entire feature columns is simply not
feasible, because we might lose too much valuable data. In this case, we can use
different interpolation techniques to estimate the missing values from the other
training samples in our dataset. One of the most common interpolation techniques
is mean imputation, where we simply replace the missing value with the mean
value of the entire feature column. A convenient way to achieve this is by using the
Imputer class from scikit-learn, as shown in the following code

In [0]:
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df.values)
imputed_data = imr.transform(df.values)
imputed_data

Here, we replaced each NaN value with the corresponding mean, which is separately
calculated for each feature column. If we changed the axis=0 setting to axis=1, we'd
calculate the row means. Other options for the strategy parameter are median or
most_frequent, where the latter replaces the missing values with the most frequent
values. This is useful for imputing categorical feature values, for example, a feature
column that stores an encoding of color names, such as red, green, and blue, and we
will encounter examples of such data later in this chapter.

In [0]:
from sklearn.impute import SimpleImputer
my_imputer = SimpleImputer()
imputed_data = pd.DataFrame(my_imputer.fit_transform(df))
imputed_data

Unnamed: 0,0,1,2,3
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,10.0,11.0,12.0,6.0


# Valores duplicados

In [0]:
# import pandas as pd
import numpy as np
 
#Create a DataFrame
d = {
    'Name':['Alisa','Bobby','jodha','jack','raghu','Cathrine',
            'Alisa','Bobby','kumar','Alisa','Alex','Cathrine'],
    'Age':[26,24,23,22,23,24,26,24,22,23,24,24],
      
       'Score':[85,63,55,74,31,77,85,63,42,62,89,77]}
 
df = pd.DataFrame(d,columns=['Name','Age','Score'])
df

In [0]:
df["is_duplicate"]= df.duplicated()
 
df

In [0]:
# drop duplicate rows
 
df.drop_duplicates()

In [0]:
# drop duplicate rows
 
df.drop_duplicates(keep='last')
#In the above example keep=’last’ argument . Keeps the last duplicate row and delete the rest duplicated rows. So the output will be

In [0]:
# Now let’s drop the rows by column name. Rows are dropped in such a way that unique column value is retained for that column as shown below
# drop duplicate by a column name
 
df.drop_duplicates(['Name'], keep='last')

#In the above example rows are deleted in such a way that, Name column contains only unique values

Más ejemplos: https://nbviewer.jupyter.org/github/pydata/pydata-book/blob/2nd-edition/ch07.ipynb 