# DataFrame Manipulation with `pandas`

`pandas` stands for "Python Data Analysis".  
`pandas` should not be used to manipulate huge datasets due to the fact that it uses the RAM memory to operate. 
That being said, `pandas` is a library/package that allows you to work with `DataFrames` that are multidimentional arrays that are easy to work with and have labels for rows and columns. 
At the core of the ```pandas``` library are two fundamental data structures/objects:
1. [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html)
2. [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

A **```Series```** object stores single-column data along with an **index**. An index is just a way of "numbering" the ```Series``` object.

A **```DataFrame```** object is a two-dimensional tabular data structure with labeled axes. 

## Loading Data into ```pandas```

The most common daty type is a CSV file. This is a plane text that is ussually separated with comas but can also be separated with other characters such as semicolons or tabs.
For example:  



Animal,Type,Colour  
Cow,Farm,White  
Cat,House,Black  
Koala,Wildlife,Grey  


~~~python
import pandas as pd
pd.read_csv("path/to/file.csv")
~~~

In [14]:
import pandas as pd
df = pd.read_csv("Data/watches.csv")

### Visualising the DataFrame

We can check the first rows using [**`.head()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) and see the last ones with [**`.tail()`**](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html):

In [15]:
df.head()

Unnamed: 0,model,store,condition,engagement,price
0,Caracal,Watches unlimited,New,77.848101,489.0
1,Caracal,National traders,Like new,75.696203,489.0
2,Caracal,National traders,Good,72.025316,490.5
3,Lightning bolt,Super deals,Like new,78.987342,501.0
4,Sand,Super deals,Good,80.126582,502.5


### Accesing to a sigle column

In [16]:
df['model']

0            Caracal
1            Caracal
2            Caracal
3     Lightning bolt
4               Sand
           ...      
70    Lightning bolt
71              Sand
72    Lightning bolt
73              Sand
74             Tempo
Name: model, Length: 75, dtype: object

### Accesing to a multiple columns

In [17]:
df[["model", "store", "condition"]]


Unnamed: 0,model,store,condition
0,Caracal,Watches unlimited,New
1,Caracal,National traders,Like new
2,Caracal,National traders,Good
3,Lightning bolt,Super deals,Like new
4,Sand,Super deals,Good
...,...,...,...
70,Lightning bolt,National traders,Very Good
71,Sand,National traders,Good
72,Lightning bolt,Watches unlimited,Fair
73,Sand,Super deals,Like new


### Indexing

By default pandas gives an index from 0 to the rows, but one can change that and use some factor of categorical (sometimes numerical) columns as indexes or group of indexes.

In [18]:
df.set_index(["model", "store", "condition"])   # this only shows but doesn' store anything.

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,engagement,price
model,store,condition,Unnamed: 3_level_1,Unnamed: 4_level_1
Caracal,Watches unlimited,New,77.848101,489.0
Caracal,National traders,Like new,75.696203,489.0
Caracal,National traders,Good,72.025316,490.5
Lightning bolt,Super deals,Like new,78.987342,501.0
Sand,Super deals,Good,80.126582,502.5
...,...,...,...,...
Lightning bolt,National traders,Very Good,80.379747,4239.0
Sand,National traders,Good,80.506329,4282.5
Lightning bolt,Watches unlimited,Fair,67.088608,4284.0
Sand,Super deals,Like new,79.493671,835.5


### Getting rows

In [19]:
df.loc[1:3]    #Always for explicit index

Unnamed: 0,model,store,condition,engagement,price
1,Caracal,National traders,Like new,75.696203,489.0
2,Caracal,National traders,Good,72.025316,490.5
3,Lightning bolt,Super deals,Like new,78.987342,501.0


In [20]:
df.iloc[1:3]           #Always for implicit index

Unnamed: 0,model,store,condition,engagement,price
1,Caracal,National traders,Like new,75.696203,489.0
2,Caracal,National traders,Good,72.025316,490.5


In [21]:
df[2:10]   # we can also extract rows this way

Unnamed: 0,model,store,condition,engagement,price
2,Caracal,National traders,Good,72.025316,490.5
3,Lightning bolt,Super deals,Like new,78.987342,501.0
4,Sand,Super deals,Good,80.126582,502.5
5,Sand,National traders,Very Good,79.493671,504.0
6,Lightning bolt,Watches unlimited,Very Good,78.860759,504.0
7,Clepsydra,National traders,Very Good,78.35443,505.5
8,Caracal,Super deals,Fair,82.405063,505.5
9,Sand,Watches unlimited,New,79.493671,510.0


In [22]:
#we can filter by giving a indication
df[df.price > 4000].tail()

Unnamed: 0,model,store,condition,engagement,price
69,Sand,Watches unlimited,Good,81.012658,4233.0
70,Lightning bolt,National traders,Very Good,80.379747,4239.0
71,Sand,National traders,Good,80.506329,4282.5
72,Lightning bolt,Watches unlimited,Fair,67.088608,4284.0
74,Tempo,Watches unlimited,Good,82.405063,4308.0


### Displaying the index

In [23]:
df.index

RangeIndex(start=0, stop=75, step=1)

### Displaying the name of the variables

In [24]:
df.columns

Index(['model', 'store', 'condition', 'engagement', 'price'], dtype='object')

### Displaying the shape of the DataFrame

In [25]:
df.shape       #first name is the numbers of rows and the second is the number of columns.

(75, 5)

### Displaying the info of the variables of the DataFrame

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   model       75 non-null     object 
 1   store       75 non-null     object 
 2   condition   75 non-null     object 
 3   engagement  75 non-null     float64
 4   price       75 non-null     float64
dtypes: float64(2), object(3)
memory usage: 3.1+ KB


## Grouping the Data

Groupping is a common task in Data Science. You can group your data and then display some interesting statistical values of the groups you have created. For example groupping by sex and display the count of how many men and women there are. 
Somo of the functions you can apply are: 
 
-   **count()**	: Returns count for each group
-   **size()**	: Returns size for each group
-   **sum()**	: Returns total sum for each group
-   **mean()**	: Returns mean for each group. Same as average()
-   **average()**	: Returns average for each group. Same as mean()
-   **std()**	: Returns standard deviation for each group
-   **var()**	: Return var for each group
-   **sem()**	: Standard error of the mean of groups
-   **describe()**	: Returns different statistics
-   **min()**	: Returns minimum value for each group
-   **max()**	: Returns maximum value for each group
-   **first()**	: Returns first value for each group
-   **last()**	: Returns last value for each group
-   **nth()**	: Returns nth value for each group

In [27]:
groups = df.groupby(["model", "condition"]) # Grouping by
groups["price"].mean() # You first slice using ["price"] and then call the mean function on the grouped Series
# The object here is a Grouped DataFrame

model           condition
Caracal         Fair         2949.5
                Good         1744.0
                Like new      508.0
                New          1818.5
                Very Good    1832.5
Clepsydra       Fair         4151.5
                Good          604.0
                Like new     1934.0
                New          4162.5
                Very Good    1737.0
Lightning bolt  Fair         4235.5
                Good          655.0
                Like new     1739.5
                New           652.5
                Very Good    1783.5
Sand            Fair         4207.5
                Good         3006.0
                Like new     1884.5
                New           543.5
                Very Good     520.0
Tempo           Fair         4189.5
                Good         4225.5
                Like new     3039.5
                New          2965.0
                Very Good    1854.5
Name: price, dtype: float64

Another way of using the group_by 

In [28]:
df.groupby(["model",]).agg({'price':['mean','var','min','max']}) # Group_by model and condition un showing the var and mean of the price
# The object here is a actual DataFrame

Unnamed: 0_level_0,price,price,price,price
Unnamed: 0_level_1,mean,var,min,max
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Caracal,1770.5,3048256.0,489.0,4207.5
Clepsydra,2517.8,3299022.0,505.5,4195.5
Lightning bolt,1813.2,3129486.0,501.0,4284.0
Sand,2032.3,3440414.0,502.5,4282.5
Tempo,3254.8,2551717.0,531.0,4308.0


### Pivoting

Pivoting is symilar to groupby, however the way of displaying the results is differemt. By pivoting we are able to see a variable(s) in columns and another variable(s) in the rows.  

We create pivot tables in `pandas` using the [**`.pivot_table()`**](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) method, with this syntax:

~~~python
pd.pivot_table(my_df, values=["numeric_column"], index=["row_variable"], columns=["column_variable"])
~~~

In [29]:
pd.pivot_table(df, values=["price"], index=["model"], columns=["condition"]) #by default is the mean

Unnamed: 0_level_0,price,price,price,price,price
condition,Fair,Good,Like new,New,Very Good
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Caracal,2949.5,1744.0,508.0,1818.5,1832.5
Clepsydra,4151.5,604.0,1934.0,4162.5,1737.0
Lightning bolt,4235.5,655.0,1739.5,652.5,1783.5
Sand,4207.5,3006.0,1884.5,543.5,520.0
Tempo,4189.5,4225.5,3039.5,2965.0,1854.5


In [30]:
pd.pivot_table(df, values=["price"], index=["model"], columns=["store"],aggfunc='count') 
#we can change to another aggregation function

Unnamed: 0_level_0,price,price,price
store,National traders,Super deals,Watches unlimited
model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Caracal,5,5,5
Clepsydra,5,5,5
Lightning bolt,5,5,5
Sand,5,5,5
Tempo,5,5,5


### stack() and unstack()

![Img](https://miro.medium.com/max/1400/1*DYDOif_qBEgtWfFKUDSf0Q.png)

In [31]:
df_index = df.set_index(["model", "store", "condition"])
df_index

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,engagement,price
model,store,condition,Unnamed: 3_level_1,Unnamed: 4_level_1
Caracal,Watches unlimited,New,77.848101,489.0
Caracal,National traders,Like new,75.696203,489.0
Caracal,National traders,Good,72.025316,490.5
Lightning bolt,Super deals,Like new,78.987342,501.0
Sand,Super deals,Good,80.126582,502.5
...,...,...,...,...
Lightning bolt,National traders,Very Good,80.379747,4239.0
Sand,National traders,Good,80.506329,4282.5
Lightning bolt,Watches unlimited,Fair,67.088608,4284.0
Sand,Super deals,Like new,79.493671,835.5


In [32]:
unstacked = df_index.unstack()   #takes the last colum of indesex and convert it into a row as a title
unstacked

Unnamed: 0_level_0,Unnamed: 1_level_0,engagement,engagement,engagement,engagement,engagement,price,price,price,price,price
Unnamed: 0_level_1,condition,Fair,Good,Like new,New,Very Good,Fair,Good,Like new,New,Very Good
model,store,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
Caracal,National traders,69.746835,72.025316,75.696203,79.113924,79.240506,4135.5,490.5,489.0,4135.5,831.0
Caracal,Super deals,82.405063,81.139241,76.202532,79.620253,80.759494,505.5,603.0,517.5,831.0,528.0
Caracal,Watches unlimited,74.303797,72.78481,77.088608,77.848101,78.987342,4207.5,4138.5,517.5,489.0,4138.5
Clepsydra,National traders,83.164557,80.759494,77.848101,77.468354,78.35443,4144.5,604.5,4140.0,4147.5,505.5
Clepsydra,Super deals,81.518987,81.012658,77.468354,78.481013,77.21519,4144.5,603.0,831.0,4144.5,529.5
Clepsydra,Watches unlimited,85.949367,79.873418,79.620253,77.974684,80.632911,4165.5,604.5,831.0,4195.5,4176.0
Lightning bolt,National traders,86.202532,80.0,79.113924,77.974684,80.379747,4216.5,607.5,532.5,831.0,4239.0
Lightning bolt,Super deals,85.189873,80.126582,78.987342,78.481013,79.240506,4206.0,526.5,501.0,522.0,607.5
Lightning bolt,Watches unlimited,67.088608,80.886076,74.43038,77.468354,78.860759,4284.0,831.0,4185.0,604.5,504.0
Sand,National traders,83.291139,80.506329,75.063291,78.734177,79.493671,4183.5,4282.5,607.5,516.0,504.0


In [33]:
unstacked.stack()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,engagement,price
model,store,condition,Unnamed: 3_level_1,Unnamed: 4_level_1
Caracal,National traders,Fair,69.746835,4135.5
Caracal,National traders,Good,72.025316,490.5
Caracal,National traders,Like new,75.696203,489.0
Caracal,National traders,New,79.113924,4135.5
Caracal,National traders,Very Good,79.240506,831.0
...,...,...,...,...
Tempo,Watches unlimited,Fair,69.746835,4173.0
Tempo,Watches unlimited,Good,82.405063,4308.0
Tempo,Watches unlimited,Like new,77.088608,4149.0
Tempo,Watches unlimited,New,78.354430,4149.0


# Practicing

In [34]:
titanic = pd.read_csv('Data/titanic.csv')


In [35]:
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [36]:
# Average age of men and female 
titanic.groupby(['Sex']).agg({'Age':'mean'})

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


In [37]:
# Average age of men and female that survived
titanic.groupby(['Sex','Survived']).agg({'Age':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Sex,Survived,Unnamed: 2_level_1
female,0,25.046875
female,1,28.847716
male,0,31.618056
male,1,27.276022


In [38]:
# Average age of men and female that survived with Puvot Table
pd.pivot_table(titanic,values = ['Age'] , index = ['Sex'] , columns= ['Survived'])

Unnamed: 0_level_0,Age,Age
Survived,0,1
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,25.046875,28.847716
male,31.618056,27.276022


In [39]:
# how many men and female survived with pivot_table
pd.pivot_table(titanic,values = ['Age'] , index = ['Sex'] , columns= ['Survived'],aggfunc="count")

Unnamed: 0_level_0,Age,Age
Survived,0,1
Sex,Unnamed: 1_level_2,Unnamed: 2_level_2
female,64,197
male,360,93


In [40]:
# how many men and female survived with groupby
titanic.groupby(['Sex','Survived']).agg({'Age':'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Age
Sex,Survived,Unnamed: 2_level_1
female,0,64
female,1,197
male,0,360
male,1,93


In [41]:
# average age of male and femalo according to the class they travled in with groupby
titanic.groupby(['Sex','Survived','Pclass']).agg({"Age":'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Age
Sex,Survived,Pclass,Unnamed: 3_level_1
female,0,1,25.666667
female,0,2,36.0
female,0,3,23.818182
female,1,1,34.939024
female,1,2,28.080882
female,1,3,19.329787
male,0,1,44.581967
male,0,2,33.369048
male,0,3,27.255814
male,1,1,36.248


In [42]:
# average age of male and femalo according to the class they travled in with pivot_table
pd.pivot_table(titanic , values= "Age" , columns= ['Survived'] , index=['Sex','Pclass'],aggfunc='mean')

Unnamed: 0_level_0,Survived,0,1
Sex,Pclass,Unnamed: 2_level_1,Unnamed: 3_level_1
female,1,25.666667,34.939024
female,2,36.0,28.080882
female,3,23.818182,19.329787
male,1,44.581967,36.248
male,2,33.369048,16.022
male,3,27.255814,22.274211


In [43]:
# Average ticket price of the ones who survived compared to the ones that didn't with groupby
titanic.groupby(['Survived']).agg({'Fare':'mean'})

Unnamed: 0_level_0,Fare
Survived,Unnamed: 1_level_1
0,22.117887
1,48.395408


In [44]:
# Average ticket price of the ones who survived compared to the ones that didn't with pivot_table
pd.pivot_table(data=titanic, values= 'Fare' , index='Survived',aggfunc='mean')

Unnamed: 0_level_0,Fare
Survived,Unnamed: 1_level_1
0,22.117887
1,48.395408


In [45]:
#mean of the price of the ticket accordin to the class with pivot_table
pd.pivot_table(titanic, values= 'Fare' , index= 'Pclass', aggfunc = 'mean')

Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,84.154687
2,20.662183
3,13.67555


In [46]:
#mean of the price of the ticket accordin to the class with groupby
titanic.groupby(['Pclass']).agg({'Fare':'mean'})

Unnamed: 0_level_0,Fare
Pclass,Unnamed: 1_level_1
1,84.154687
2,20.662183
3,13.67555


In [47]:
#mean and count of the price of the ticket accordin to the class with groupby
titanic.groupby(['Pclass','Survived']).agg({'Fare':['mean','count']})

Unnamed: 0_level_0,Unnamed: 1_level_0,Fare,Fare
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,count
Pclass,Survived,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0,64.684007,80
1,1,95.608029,136
2,0,19.412328,97
2,1,22.0557,87
3,0,13.669364,372
3,1,13.694887,119


In [48]:
#mean and count of the price of the ticket accordin to the class with pivot_table
pd.pivot_table(titanic, values= 'Fare' , index= 'Pclass', columns= 'Survived' , aggfunc = ['mean','count'])

Unnamed: 0_level_0,mean,mean,count,count
Survived,0,1,0,1
Pclass,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,64.684007,95.608029,80,136
2,19.412328,22.0557,97,87
3,13.669364,13.694887,372,119


# More of ``Pandas``
## methods and functions

We will work with a dataset of global electricity. Questions to solve are:

1. How much power is produced?
2. How much power is consumed?
3. How much power is imported and exported?
4. How much of this power is renewable?

In [49]:
import pandas as pd

In [50]:
df = pd.read_csv('Data/all_energy_statistics.csv')

In [51]:
df

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates
...,...,...,...,...,...,...,...
1189477,Viet Nam,Electricity - total wind production,2012,"Kilowatt-hours, million",92.0,1.0,wind_electricity
1189478,Viet Nam,Electricity - total wind production,2011,"Kilowatt-hours, million",87.0,,wind_electricity
1189479,Viet Nam,Electricity - total wind production,2010,"Kilowatt-hours, million",50.0,,wind_electricity
1189480,Viet Nam,Electricity - total wind production,2009,"Kilowatt-hours, million",10.0,,wind_electricity


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1189482 entries, 0 to 1189481
Data columns (total 7 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   country_or_area        1189482 non-null  object 
 1   commodity_transaction  1189482 non-null  object 
 2   year                   1189482 non-null  int64  
 3   unit                   1189482 non-null  object 
 4   quantity               1189482 non-null  float64
 5   quantity_footnotes     163946 non-null   float64
 6   category               1189482 non-null  object 
dtypes: float64(2), int64(1), object(4)
memory usage: 63.5+ MB


## How many unique values are in certain Serie

In [53]:
df['country_or_area'].drop_duplicates()

0                                          Austria
2                                          Belgium
8                                          Czechia
10                                         Finland
26                                          France
                            ...                   
212765                                      Tuvalu
212938                    United States Virgin Is.
213088                       Wallis and Futuna Is.
362966    Commonwealth of Independent States (CIS)
399113                         Antarctic Fisheries
Name: country_or_area, Length: 243, dtype: object

In [54]:
df['category'].drop_duplicates()

0                                   additives_and_oxygenates
3018                                            animal_waste
4940                                              anthracite
9834                                       aviation_gasoline
28005                                                bagasse
                                 ...                        
1037653                                    total_electricity
1171569                                total_refinery_output
1177352                                              uranium
1178036    white_spirit_and_special_boiling_point_industr...
1188115                                     wind_electricity
Name: category, Length: 71, dtype: object

## Description of the **numerical** variables

In [55]:
df.describe() 

Unnamed: 0,year,quantity,quantity_footnotes
count,1189482.0,1189482.0,163946.0
mean,2002.852,184264.8,1.0
std,7.167345,15856630.0,0.0
min,1990.0,-864348.0,1.0
25%,1997.0,14.0,1.0
50%,2003.0,189.0,1.0
75%,2009.0,2265.0,1.0
max,2014.0,6680329000.0,1.0


## Description of the object variables

In [56]:
df.describe(include=['O'])        # includes the columns object type

Unnamed: 0,country_or_area,commodity_transaction,unit,category
count,1189482,1189482,1189482,1189482
unique,243,2452,6,71
top,Germany,From combustible fuels – Main activity,"Metric tons, thousand",total_electricity
freq,20422,6601,759859,133916


## `min()` and `max()` functions

In [57]:
print(df['year'].min())
print(df['year'].max())

1990
2014


## changine the `str` columns

In [58]:
df['commodity_transaction'].str.lower()

0           additives and oxygenates - exports
1           additives and oxygenates - exports
2           additives and oxygenates - exports
3           additives and oxygenates - exports
4           additives and oxygenates - exports
                          ...                 
1189477    electricity - total wind production
1189478    electricity - total wind production
1189479    electricity - total wind production
1189480    electricity - total wind production
1189481    electricity - total wind production
Name: commodity_transaction, Length: 1189482, dtype: object

# Drop duplicated rows

In [59]:
df["commodity_transaction"].drop_duplicates().str.count("-")
#drop_cuplicate removes the duplicated rows

0          1
149        1
617        1
906        1
1000       1
          ..
1186132    1
1187689    1
1187961    1
1188038    1
1188115    1
Name: commodity_transaction, Length: 2452, dtype: int64

# count of strings patterns

In [60]:
df["commodity_transaction"].str.count("-")

0          1
1          1
2          1
3          1
4          1
          ..
1189477    1
1189478    1
1189479    1
1189480    1
1189481    1
Name: commodity_transaction, Length: 1189482, dtype: int64

# What is the frequency of each category in a Serie

In [61]:
df["commodity_transaction"].value_counts()

From combustible fuels – Main activity                 6601
Electricity - Gross demand                             5532
Electricity - net production                           5523
Electricity - total production, main activity          5523
Electricity - Gross production                         5523
                                                       ... 
Biodiesel - Net transfers                                 1
Lubricants - Consumption by wood and wood products        1
Refinery Gas - Consumption by mining and quarrying        1
Refinery Gas - Consumption by transport equipment         1
Lignite - Consumption by domestic navigation              1
Name: commodity_transaction, Length: 2452, dtype: int64

Combining all we just saw.  
104 values contain the word "Electricity" and 2348 don't

In [62]:
df['commodity_transaction'].drop_duplicates().str.count('Electricity').value_counts()

0    2348
1     104
Name: commodity_transaction, dtype: int64

next chunk counts the  

In [63]:
df['commodity_transaction'][df['commodity_transaction'].str.count('–')==0].drop_duplicates()  

0                         Additives and Oxygenates - Exports
149                       Additives and Oxygenates - Imports
617                    Additives and Oxygenates - Production
906        Additives and Oxygenates - Receipts from other...
1000                Additives and Oxygenates - Stock changes
                                 ...                        
1186132    White spirit and special boiling point industr...
1187689    White spirit and special boiling point industr...
1187961    White spirit and special boiling point industr...
1188038    White spirit and special boiling point industr...
1188115                  Electricity - total wind production
Name: commodity_transaction, Length: 2397, dtype: object

### compoused condition (filter)

In [64]:
df['commodity_transaction'][(df['commodity_transaction'].str.count('–')==0) & (df['commodity_transaction'].str.count('-')==0)].drop_duplicates()

1171569        Total refinery output
1174459    Total refinery throughput
Name: commodity_transaction, dtype: object

### Replace values

In [65]:
df['clean_transaction'] = df["commodity_transaction"].str.replace("-" , "–").str.lower()

In [66]:
df.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
0,Austria,Additives and Oxygenates - Exports,1996,"Metric tons, thousand",5.0,,additives_and_oxygenates,additives and oxygenates – exports
1,Austria,Additives and Oxygenates - Exports,1995,"Metric tons, thousand",17.0,,additives_and_oxygenates,additives and oxygenates – exports
2,Belgium,Additives and Oxygenates - Exports,2014,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates – exports
3,Belgium,Additives and Oxygenates - Exports,2013,"Metric tons, thousand",0.0,,additives_and_oxygenates,additives and oxygenates – exports
4,Belgium,Additives and Oxygenates - Exports,2012,"Metric tons, thousand",35.0,,additives_and_oxygenates,additives and oxygenates – exports


## Extracting rows

In [67]:
df["clean_transaction"][df["clean_transaction"].str.contains("import")].drop_duplicates()

149                       additives and oxygenates – imports
7831                                    anthracite – imports
19202                            aviation gasoline – imports
43469                                    biodiesel – imports
50278                                     biogases – imports
58332                                  biogasoline – imports
65415                                      bitumen – imports
100684                       brown coal briquettes – imports
113746                                  brown coal – imports
140375                                    charcoal – imports
151396                                    coal tar – imports
166623                                 coking coal – imports
174196                      conventional crude oil – imports
255441                                      ethane – imports
297781                                    fuel oil – imports
362577                                    fuelwood – imports
384982                  

In [68]:
df['commodity_transaction'].isin(keep_values)

NameError: name 'keep_values' is not defined

In [None]:
df[df["clean_transaction"].str.contains('electricity – imports')]

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
1108326,Afghanistan,Electricity - imports,2014,"Kilowatt-hours, million",3710.8,,total_electricity,electricity – imports
1108327,Afghanistan,Electricity - imports,2013,"Kilowatt-hours, million",3615.2,,total_electricity,electricity – imports
1108328,Afghanistan,Electricity - imports,2012,"Kilowatt-hours, million",3071.0,,total_electricity,electricity – imports
1108329,Afghanistan,Electricity - imports,2011,"Kilowatt-hours, million",2732.0,,total_electricity,electricity – imports
1108330,Afghanistan,Electricity - imports,2010,"Kilowatt-hours, million",1867.0,,total_electricity,electricity – imports
...,...,...,...,...,...,...,...,...
1110998,Zimbabwe,Electricity - imports,1994,"Kilowatt-hours, million",2073.0,,total_electricity,electricity – imports
1110999,Zimbabwe,Electricity - imports,1993,"Kilowatt-hours, million",1921.0,,total_electricity,electricity – imports
1111000,Zimbabwe,Electricity - imports,1992,"Kilowatt-hours, million",1201.0,,total_electricity,electricity – imports
1111001,Zimbabwe,Electricity - imports,1991,"Kilowatt-hours, million",1838.0,,total_electricity,electricity – imports


## pivoting factor values

In [None]:
keep_values =  [
        "Electricity - Gross demand",
        "Electricity - Gross production",
        "Electricity - imports",
        "Electricity - exports",
        "Electricity - total hydro production",
        "Electricity - total wind production",
        "Electricity - total solar production",
        "Electricity - total geothermal production",
        "Electricity - total tide, wave production",
]


In [None]:
df_filtered = df[df['commodity_transaction'].isin(keep_values)]

In [None]:
df_filtered.head()

Unnamed: 0,country_or_area,commodity_transaction,year,unit,quantity,quantity_footnotes,category,clean_transaction
490912,Australia,Electricity - total geothermal production,2014,"Kilowatt-hours, million",1.0,,geothermal,electricity – total geothermal production
490913,Australia,Electricity - total geothermal production,2013,"Kilowatt-hours, million",1.0,,geothermal,electricity – total geothermal production
490914,Australia,Electricity - total geothermal production,2012,"Kilowatt-hours, million",1.0,,geothermal,electricity – total geothermal production
490915,Australia,Electricity - total geothermal production,2011,"Kilowatt-hours, million",1.0,,geothermal,electricity – total geothermal production
490916,Australia,Electricity - total geothermal production,2010,"Kilowatt-hours, million",1.0,,geothermal,electricity – total geothermal production


In [None]:
df_countries = pd.pivot_table(df_filtered,
values = "quantity",
index = ["country_or_area" , "year"],
columns= "commodity_transaction")
df_countries

Unnamed: 0_level_0,commodity_transaction,Electricity - Gross demand,Electricity - Gross production,Electricity - exports,Electricity - imports,Electricity - total geothermal production,Electricity - total hydro production,Electricity - total solar production,"Electricity - total tide, wave production",Electricity - total wind production
country_or_area,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,1990,1055.0,1128.0,,,,764.0,,,
Afghanistan,1991,945.0,1015.0,,,,690.0,,,
Afghanistan,1992,789.0,703.0,,131.0,,478.0,,,
Afghanistan,1993,780.0,695.0,,130.0,,475.0,,,
Afghanistan,1994,770.0,687.0,,128.0,,472.0,,,
...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,2010,9317.3,8602.9,694.4,1681.7,,5762.8,,,
Zimbabwe,2011,9645.5,9177.2,988.2,1578.7,,5201.8,,,
Zimbabwe,2012,9425.2,9148.6,700.9,1076.1,,5387.3,,,
Zimbabwe,2013,9919.7,9498.8,1189.3,1722.0,,4981.8,,,


### Redefining the columns names

In [None]:
df_countries.columns = [
    "demand",
    "production",
    "exports",
    "imports",
    "geothermal",
    "hydro",
    "solar",
    "tide",
    "wind",
]
df_countries

Unnamed: 0_level_0,Unnamed: 1_level_0,demand,production,exports,imports,geothermal,hydro,solar,tide,wind
country_or_area,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Afghanistan,1990,1055.0,1128.0,,,,764.0,,,
Afghanistan,1991,945.0,1015.0,,,,690.0,,,
Afghanistan,1992,789.0,703.0,,131.0,,478.0,,,
Afghanistan,1993,780.0,695.0,,130.0,,475.0,,,
Afghanistan,1994,770.0,687.0,,128.0,,472.0,,,
...,...,...,...,...,...,...,...,...,...,...
Zimbabwe,2010,9317.3,8602.9,694.4,1681.7,,5762.8,,,
Zimbabwe,2011,9645.5,9177.2,988.2,1578.7,,5201.8,,,
Zimbabwe,2012,9425.2,9148.6,700.9,1076.1,,5387.3,,,
Zimbabwe,2013,9919.7,9498.8,1189.3,1722.0,,4981.8,,,


In [None]:
df_countries.sort_values(by='production',ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,demand,production,exports,imports,geothermal,hydro,solar,tide,wind
country_or_area,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
China,2014,5219096.0,5649583.4,18158.0,6750.0,,1064337.0,15189.0,,156078.0
China,2013,5016127.0,5431637.4,18669.0,7438.0,,920291.0,5564.0,,141197.0
China,2012,4609729.0,4987553.0,17653.0,6874.0,,872107.0,,,95978.0
China,2011,4319132.0,4713019.0,19307.0,6562.0,,698945.0,,,70331.0
United States,2010,4153664.0,4378422.0,19107.0,45083.0,17577.0,286333.0,3934.0,,95148.0
...,...,...,...,...,...,...,...,...,...,...
Lesotho,1994,310.0,,,310.0,,,,,
Lesotho,1995,324.0,,,324.0,,,,,
Lesotho,1996,335.0,,,335.0,,,,,
Lesotho,1997,395.0,,,395.0,,,,,


# Delete explicit indexes

In [None]:
df_countries = df_countries.reset_index()

In [None]:
df_countries['year'].value_counts()

2014    229
2013    229
2012    229
2011    226
2010    226
2009    226
2008    226
2007    226
2006    225
2005    225
2003    224
2004    224
2002    224
2001    221
2000    220
1999    220
1998    220
1997    220
1996    220
1995    220
1994    220
1993    219
1992    219
1991    197
1990    197
Name: year, dtype: int64

# Fill NaN with 0

In [None]:
df_countries = df_countries.fillna(0)
df_countries

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind
0,Afghanistan,1990,1055.0,1128.0,0.0,0.0,0.0,764.0,0.0,0.0,0.0
1,Afghanistan,1991,945.0,1015.0,0.0,0.0,0.0,690.0,0.0,0.0,0.0
2,Afghanistan,1992,789.0,703.0,0.0,131.0,0.0,478.0,0.0,0.0,0.0
3,Afghanistan,1993,780.0,695.0,0.0,130.0,0.0,475.0,0.0,0.0,0.0
4,Afghanistan,1994,770.0,687.0,0.0,128.0,0.0,472.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
5527,Zimbabwe,2010,9317.3,8602.9,694.4,1681.7,0.0,5762.8,0.0,0.0,0.0
5528,Zimbabwe,2011,9645.5,9177.2,988.2,1578.7,0.0,5201.8,0.0,0.0,0.0
5529,Zimbabwe,2012,9425.2,9148.6,700.9,1076.1,0.0,5387.3,0.0,0.0,0.0
5530,Zimbabwe,2013,9919.7,9498.8,1189.3,1722.0,0.0,4981.8,0.0,0.0,0.0


uploading new DF without NaN

In [120]:
import pandas as pd
df_countries = pd.read_csv("Data/df_countries_no_na.csv")

In [121]:
df_countries.head()

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind
0,China,2014,5219096.0,5649583.4,18158.0,6750.0,0.0,1064337.0,15189.0,0,156078.0
1,China,2013,5016127.0,5431637.4,18669.0,7438.0,0.0,920291.0,5564.0,0,141197.0
2,China,2012,4609729.0,4987553.0,17653.0,6874.0,0.0,872107.0,0.0,0,95978.0
3,China,2011,4319132.0,4713019.0,19307.0,6562.0,0.0,698945.0,0.0,0,70331.0
4,United States,2010,4153664.0,4378422.0,19107.0,45083.0,17577.0,286333.0,3934.0,0,95148.0


# Sum values of different columns


In [122]:
df_countries['renewable_total'] = df_countries[['hydro', 'wind','solar','geothermal','tide']].sum(axis="columns")
df_countries['renewable_percent'] = df_countries['renewable_total'] / df_countries['production']

In [123]:
df_countries

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
0,China,2014,5219096.0,5649583.4,18158.0,6750.0,0.0,1064337.0,15189.0,0,156078.0,1235604.0,0.218707
1,China,2013,5016127.0,5431637.4,18669.0,7438.0,0.0,920291.0,5564.0,0,141197.0,1067052.0,0.196451
2,China,2012,4609729.0,4987553.0,17653.0,6874.0,0.0,872107.0,0.0,0,95978.0,968085.0,0.194100
3,China,2011,4319132.0,4713019.0,19307.0,6562.0,0.0,698945.0,0.0,0,70331.0,769276.0,0.163224
4,United States,2010,4153664.0,4378422.0,19107.0,45083.0,17577.0,286333.0,3934.0,0,95148.0,402992.0,0.092040
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5527,Lesotho,1994,310.0,0.0,0.0,310.0,0.0,0.0,0.0,0,0.0,0.0,
5528,Lesotho,1995,324.0,0.0,0.0,324.0,0.0,0.0,0.0,0,0.0,0.0,
5529,Lesotho,1996,335.0,0.0,0.0,335.0,0.0,0.0,0.0,0,0.0,0.0,
5530,Lesotho,1997,395.0,0.0,0.0,395.0,0.0,0.0,0.0,0,0.0,0.0,


In [124]:
df_countries[df_countries['year']==2014].sort_values(by='renewable_percent',ascending=False).head(5)

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
2655,Albania,2014,7791.43,4724.43,183.45,3250.45,0.0,4724.43,0.0,0,0.0,4724.43,1.0
3924,Lesotho,2014,783.48,515.2,2.92,271.2,0.0,515.2,0.0,0,0.0,515.2,1.0
2357,Bhutan,2014,2085.46,7003.86,4991.9,187.37,0.0,7003.36,0.0,0,0.0,7003.36,0.999929
1008,Paraguay,2014,13432.0,55282.3,41400.1,0.0,0.0,55276.4,0.0,0,0.0,55276.4,0.999893
1704,Iceland,2014,17475.0,18122.0,0.0,0.0,5238.0,12873.0,0.0,0,8.0,18119.0,0.999834


In [125]:
# get only the top producers and redo the analysis
threshold = df_countries['production'].quantile(0.9)
df_countries[(df_countries.production > threshold) & (df_countries.year == 2014)].sort_values(by='renewable_percent',ascending=False).head(5)

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
536,Norway,2014,124139.0,142327.0,21932.0,6347.0,0.0,136636.0,0.0,0,2216.0,138852.0,0.975584
138,Brazil,2014,615629.0,590541.0,3.0,33778.0,0.0,373439.0,16.0,0,12211.0,385666.0,0.653072
111,Canada,2014,591137.0,656225.0,58421.0,12808.0,0.0,382574.0,1756.0,16,22538.0,406884.0,0.620037
493,Sweden,2014,132375.0,153662.0,29475.0,13852.0,0.0,63872.0,47.0,0,11234.0,75153.0,0.48908
522,Viet Nam,2014,141136.0,145730.0,880.0,2053.0,0.0,61480.0,0.0,0,300.0,61780.0,0.423935


# de los paises que mas producción tuvieron en el 2014, muestre cuales son los que mayour porcentaje renovable tuvieron 


In [126]:
df_countries[(df_countries.production > threshold) & (df_countries.year == 2014)].sort_values(by='renewable_percent',ascending=True).head(5)

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
258,Saudi Arabia,2014,304240.0,311806.0,0.0,0.0,0.0,0.0,1.0,0,0.0,1.0,3e-06
170,"Korea, Republic of",2014,523363.0,550933.0,0.0,0.0,0.0,7820.0,2557.0,492,1146.0,12015.0,0.021808
316,South Africa,2014,231445.0,252578.0,13836.0,11177.0,0.0,4082.0,1120.0,0,1070.0,6272.0,0.024832
304,Other Asia,2014,244755.0,260025.0,0.0,0.0,0.0,7439.0,552.0,0,1500.0,9491.0,0.0365
429,Thailand,2014,179330.0,180862.0,2066.0,12260.0,1.0,5540.0,1385.0,0,305.0,7231.0,0.039981


In [127]:
df_countries.sort_values(by="renewable_percent",ascending=False,inplace=True)      #inplace=True allows not to assign everything to the variable again
df_countries

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
3187,Bhutan,1998,446.0,1801.0,1357.0,8.0,0.0,1801.0,0.0,0,0.0,1801.0,1.0
4042,Lesotho,2004,413.2,403.9,6.1,15.4,0.0,403.9,0.0,0,0.0,403.9,1.0
3123,Bhutan,1996,413.0,1972.0,1560.0,7.0,0.0,1972.0,0.0,0,0.0,1972.0,1.0
3935,Lesotho,2008,576.6,503.4,3.8,77.0,0.0,503.4,0.0,0,0.0,503.4,1.0
3923,Lesotho,2013,798.0,515.3,2.2,284.9,0.0,515.3,0.0,0,0.0,515.3,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5527,Lesotho,1994,310.0,0.0,0.0,310.0,0.0,0.0,0.0,0,0.0,0.0,
5528,Lesotho,1995,324.0,0.0,0.0,324.0,0.0,0.0,0.0,0,0.0,0.0,
5529,Lesotho,1996,335.0,0.0,0.0,335.0,0.0,0.0,0.0,0,0.0,0.0,
5530,Lesotho,1997,395.0,0.0,0.0,395.0,0.0,0.0,0.0,0,0.0,0.0,


In [129]:
renewable_change = pd.pivot_table(df_countries , values= 'renewable_percent' , index = 'country_or_area' , columns= "year")
renewable_change

year,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014
country_or_area,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,0.677305,0.679803,0.679943,0.683453,0.687045,0.690370,0.703704,0.723881,0.744361,0.737226,...,0.740618,0.707558,0.720000,0.686548,0.871766,0.859865,0.824876,0.859100,0.786364,0.853235
Albania,0.876134,0.950693,0.958892,0.951206,0.966180,0.952424,0.966250,0.969907,0.970994,0.979062,...,0.987139,0.983164,0.975560,1.000000,0.999834,0.999888,0.985939,1.000000,1.000000,1.000000
Algeria,0.008383,0.016892,0.010883,0.018182,0.008347,0.009790,0.006294,0.003490,0.004235,0.004056,...,0.016882,0.006189,0.006076,0.007034,0.007948,0.003805,0.009800,0.010837,0.005510,0.003954
American Samoa,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.007024,0.007031,0.007009
Andorra,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,...,0.988235,0.945946,0.750000,0.800508,0.765432,0.887311,0.865132,0.866894,0.887533,0.894322
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yemen Arab Rep. (former),0.000000,,,,,,,,,,...,,,,,,,,,,
"Yemen, Dem. (former)",0.000000,,,,,,,,,,...,,,,,,,,,,
"Yugoslavia, SFR (former)",0.242374,0.242387,,,,,,,,,...,,,,,,,,,,
Zambia,0.994853,0.994855,0.994859,0.994605,0.994605,0.993311,0.992051,0.993074,0.991582,0.993302,...,0.994069,0.998657,0.998756,0.998741,0.998787,0.998756,0.998695,0.998545,0.998571,0.971630


In [134]:
renewable_change = pd.pivot_table(df_countries , values= 'renewable_percent' , index = 'country_or_area' , columns= "year").reset_index()[['country_or_area',1990,2014]]
renewable_change

year,country_or_area,1990,2014
0,Afghanistan,0.677305,0.853235
1,Albania,0.876134,1.000000
2,Algeria,0.008383,0.003954
3,American Samoa,0.000000,0.007009
4,Andorra,1.000000,0.894322
...,...,...,...
236,Yemen Arab Rep. (former),0.000000,
237,"Yemen, Dem. (former)",0.000000,
238,"Yugoslavia, SFR (former)",0.242374,
239,Zambia,0.994853,0.971630


In [139]:
renewable_change["diff"] = renewable_change[2014] - renewable_change[1990]
renewable_change.sort_values(by="diff",ascending=False)

year,country_or_area,1990,2014,diff
86,Greenland,0.000000,0.683475,0.683475
185,Sierra Leone,0.000000,0.653569,0.653569
75,French Guiana,0.000000,0.605495,0.605495
20,Belize,0.000000,0.507055,0.507055
58,Denmark,0.024555,0.425380,0.400824
...,...,...,...,...
234,Wallis and Futuna Is.,,0.000000,
235,Yemen,,0.000000,
236,Yemen Arab Rep. (former),0.000000,,
237,"Yemen, Dem. (former)",0.000000,,


# `apply()` function in pandas

Execute a function either row by row or column by column 

~~~python
my_serie.apply(my_function)
~~~

In [140]:
def assign_label(value):
    if value <500:
        label = "Less than 500"
    elif value <5000:
        label = "Between 500 and 5,000"
    elif value <50000:
        label = "Between 5,000 and 50,000"
    else:
        label = "50,000 or more"
    return label

In [146]:
assign_label(40000)

'Between 5,000 and 50,000'

In [141]:
df_countries_2014 = df_countries[df_countries.year == 2014]
df_countries_2014

Unnamed: 0,country_or_area,year,demand,production,exports,imports,geothermal,hydro,solar,tide,wind,renewable_total,renewable_percent
3924,Lesotho,2014,783.48,515.20,2.92,271.20,0.0,515.20,0.0,0,0.0,515.20,1.000000
2655,Albania,2014,7791.43,4724.43,183.45,3250.45,0.0,4724.43,0.0,0,0.0,4724.43,1.000000
2357,Bhutan,2014,2085.46,7003.86,4991.90,187.37,0.0,7003.36,0.0,0,0.0,7003.36,0.999929
1008,Paraguay,2014,13432.00,55282.30,41400.10,0.00,0.0,55276.40,0.0,0,0.0,55276.40,0.999893
1704,Iceland,2014,17475.00,18122.00,0.00,0.00,5238.0,12873.00,0.0,0,8.0,18119.00,0.999834
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4145,State of Palestine,2014,5274.90,336.60,0.00,4938.30,0.0,0.00,0.0,0,0.0,0.00,0.000000
4124,Timor-Leste,2014,349.00,349.00,0.00,0.00,0.0,0.00,0.0,0,0.0,0.00,0.000000
4121,Somalia,2014,350.00,350.00,0.00,0.00,0.0,0.00,0.0,0,0.0,0.00,0.000000
4022,Sint Maarten (Dutch part),2014,417.90,417.90,0.00,0.00,0.0,0.00,0.0,0,0.0,0.00,0.000000


In [147]:
df_countries_2014['exports'].apply(assign_label)

3924               Less than 500
2655               Less than 500
2357       Between 500 and 5,000
1008    Between 5,000 and 50,000
1704               Less than 500
                  ...           
4145               Less than 500
4124               Less than 500
4121               Less than 500
4022               Less than 500
4020               Less than 500
Name: exports, Length: 229, dtype: object

In [149]:
def sum_may_mil(value):
    filtered_column = value[value>1000]
    my_sum = filtered_column.sum()
    return my_sum

df_countries[["hydro","wind","solar","geothermal","tide"]].apply(sum_may_mil,axis='index')


hydro         7.245072e+07
wind          3.721053e+06
solar         5.262870e+05
geothermal    1.291233e+06
tide          0.000000e+00
dtype: float64