## HackerRank SQL Problems
solved in both SQL and pandas

## 1. Revising the Select Query I

Query all columns for all American cities in **CITY** with populations larger than 100000. The CountryCode for America is USA.

**Input Format**

The **CITY** table is described as follows: ![Problem1Table](https://s3.amazonaws.com/hr-challenge-images/8137/1449729804-f21d187d0f-CITY.jpg)

mysql solution
```mysql
select * 
from CITY
where countrycode='USA' and population>100000
```

pandas
```python
df=read.csv('table.csv')
df.loc[(df['population']>100000)&(df['countrycode']=='USA')]
```

## 2. Revising the Select Query II

Query the names of all American cities in **CITY** with populations larger than 120000. The CountryCode for America is USA.

**Input Format**

The **CITY** table is described as follows: ![Problem1Table](https://s3.amazonaws.com/hr-challenge-images/8137/1449729804-f21d187d0f-CITY.jpg)

mysql solution
```mysql
select distinct name
from CITY
where countrycode='USA' and population>120000
```

pandas
```python
df=read.csv('table.csv')
df.loc[(df['population']>120000)&(df['countrycode']=='USA'),'name'].unique()
```

## 3. Select By ID

Query all columns for a city in CITY with the ID 1661.

**Input Format**

The **CITY** table is described as follows: ![Problem1Table](https://s3.amazonaws.com/hr-challenge-images/8137/1449729804-f21d187d0f-CITY.jpg)

mysql solution
```mysql
select *
from CITY
where id=1661
```

pandas solution
```python
df=read.csv('table.csv',index_col='id')
df.loc[1661]
```

## 4. Japanese Cities' Attributes

Query all attributes of every Japanese city in the CITY table. The COUNTRYCODE for Japan is JPN

**Input Format**

The **CITY** table is described as follows: ![Problem1Table](https://s3.amazonaws.com/hr-challenge-images/8137/1449729804-f21d187d0f-CITY.jpg)

mysql solution
```mysql
select *
from CITY
where countrycode='JPN'
```

pandas
```python
df=read.csv('table.csv')
df.loc[(df['countrycode']=='JPN')]
```

## 5. Japanese Cities' Names

Query the names of all the Japanese cities in the CITY table. The COUNTRYCODE for Japan is JPN.

**Input Format**

The **CITY** table is described as follows: ![Problem1Table](https://s3.amazonaws.com/hr-challenge-images/8137/1449729804-f21d187d0f-CITY.jpg)

mysql solution
```mysql
select distinct name
from CITY
where countrycode='JPN'
```

pandas
```python
df=read.csv('table.csv')
df.loc[(df['countrycode']=='JPN'),'name'].unique()
```

## 6. Weather Observation Station 1

Query a list of CITY and STATE from the **STATION** table.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select city,state
from station
```

## 7. Weather Observation Station 3

Query a list of CITY names from STATION with even ID numbers only. You may print the results in any order, but must exclude duplicates from your answer.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where id%2=0
```

## 7. Weather Observation Station 4

Let N be the number of CITY entries in STATION, and let N' be the number of distinct CITY names in STATION; query the value of N-N' from STATION. In other words, find the difference between the total number of CITY entries in the table and the number of distinct CITY entries in the table.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select count(city)-count(distinct city)
from station
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df.shape[0]-df['city'].nunique()
```

## 8. Weather Observation Station 5

Query the two cities in STATION with the shortest and longest CITY names, as well as their respective lengths (i.e.: number of characters in the name). If there is more than one smallest or largest city, choose the one that comes first when ordered alphabetically.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
(select city, char_length(city) as length from station order by char_length(city),city asc limit 1)
union all
(select city, char_length(city) as length from station order by char_length(city) desc,city asc limit 1)
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df['namelength']=df['city'].str.len()
df=df.sort_values(['namelength','city'],ascending=[False,True])
df.iloc[[0,-1]]
```

## 9. Weather Observation Station 6

Query the list of CITY names starting with vowels (i.e., a, e, i, o, or u) from STATION. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where left(city,1) in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df.loc[df['city'].str[0].isin(['a','e','i','o','u']),'city'].unique()
```

## 10. Weather Observation Station 7

Query the list of CITY names ending with vowels (a, e, i, o, u) from STATION. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where right(city,1) in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df.loc[df['city'].str[-1].isin(['a','e','i','o','u']),'city'].unique()
```

## 11. Weather Observation Station 8

Query the list of CITY names from STATION which have vowels (i.e., a, e, i, o, and u) as both their first and last characters. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where right(city,1) in ('a','e','i','o','u') and left(city,1) in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
startswithvowel=df['city'].str[0].isin(['a','e','i','o','u'])
endswithvowel=df['city'].str[-1].isin(['a','e','i','o','u'])
df.loc[(startswithvowel)&(endswithvowel,'city'].unique()
```

## 12. Weather Observation Station 9

Query the list of CITY names from STATION that do not start with vowels. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where left(city,1) not in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
startswithvowel=df['city'].str[0].isin(['a','e','i','o','u'])
df.loc[(~startswithvowel),'city'].unique()
```

## 13. Weather Observation Station 10

Query the list of CITY names from STATION that do not end with vowels. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where right(city,1) not in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
endswithvowel=df['city'].str[-1].isin(['a','e','i','o','u'])
df.loc[(~endswithvowel),'city'].unique()
```

## 14. Weather Observation Station 11

Query the list of CITY names from STATION that either do not start with vowels or do not end with vowels. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where right(city,1) not in ('a','e','i','o','u') or left(city,1) not in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
startswithvowel=df['city'].str[0].isin(['a','e','i','o','u'])
endswithvowel=df['city'].str[-1].isin(['a','e','i','o','u'])
df.loc[(~startswithvowel)|(~endswithvowel,'city'].unique()
```

## 15. Weather Observation Station 12

Query the list of CITY names from STATION that do not start with vowels and do not end with vowels. Your result cannot contain duplicates.

**Input Format**

The **STATION** table is described as follows: ![StationTable](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)
where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select distinct city
from station
where right(city,1) not in ('a','e','i','o','u') and left(city,1) not in ('a','e','i','o','u')
```

pandas
```python
df=read.csv('table.csv',index_col='id')
startswithvowel=df['city'].str[0].isin(['a','e','i','o','u'])
endswithvowel=df['city'].str[-1].isin(['a','e','i','o','u'])
df.loc[(~startswithvowel)&(~endswithvowel,'city'].unique()
```

## 16. Higher Than 75 Marks

Query the Name of any student in STUDENTS who scored higher than 75 Marks. Order your output by the last three characters of each name. If two or more students both have names ending in the same last three characters (i.e.: Bobby, Robby, etc.), secondary sort them by ascending ID.

**Input Format**

The STUDENTS table is described as follows: ![StudentsTable](https://s3.amazonaws.com/hr-challenge-images/12896/1443815243-94b941f556-1.png)

The Name column only contains uppercase (A-Z) and lowercase (a-z) letters.

**Sample Input**
![SampleInput](https://s3.amazonaws.com/hr-challenge-images/12896/1443815209-cf4b260993-2.png)

mysql
```mysql
select Name
from students
where Marks>75
order by right(Name,3),id
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df['substring']=df['Name'].str[-3:]
df.index.rename('indexname',inplace=True)
df=df.sort_values(['substring','indexname'])
df.loc[(df['Marks']>75),'Name']
```

## 17. Employee Names

Write a query that prints a list of employee names (i.e.: the name attribute) from the Employee table in alphabetical order.

Input Format

The Employee table containing employee data for a company is described as follows: ![EmployeeTable](https://s3.amazonaws.com/hr-challenge-images/19629/1458557872-4396838885-ScreenShot2016-03-21at4.27.13PM.png)

where employee_id is an employee's ID number, name is their name, months is the total number of months they've been working for the company, and salary is their monthly salary.

**Sample Input**
![SampleInput](https://s3.amazonaws.com/hr-challenge-images/19629/1458558202-9a8721e44b-ScreenShot2016-03-21at4.32.59PM.png)

**Sample Output**
```
Angela
Bonnie
Frank
Joe
Kimberly
Lisa
Michael
Patrick
Rose
Todd
```

mysql
```mysql
select name
from employee order by name asc
```

## 18. Employee Salaries

Write a query that prints a list of employee names (i.e.: the name attribute) for employees in Employee having a salary greater than $20000 per month who have been employees for less than 10 months. Sort your result by ascending employee_id.

Input Format

The Employee table containing employee data for a company is described as follows: ![EmployeeTable](https://s3.amazonaws.com/hr-challenge-images/19629/1458557872-4396838885-ScreenShot2016-03-21at4.27.13PM.png)

where employee_id is an employee's ID number, name is their name, months is the total number of months they've been working for the company, and salary is their monthly salary.

**Sample Input**
![SampleInput](https://s3.amazonaws.com/hr-challenge-images/19629/1458558202-9a8721e44b-ScreenShot2016-03-21at4.32.59PM.png)

**Sample Output**
```
Angela
Bonnie
Frank
Joe
Kimberly
Lisa
Michael
Patrick
Rose
Todd
```

mysql
```mysql
select name
from employee 
where salary>2000 and months<10
order by employee_id asc
```

pandas
```python
df=read.csv('table.csv',index_col='id')
df.loc[(df['salary']>2000)&(df['months']<10),'name']
```

## 19. The Report

You are given two tables: Students and Grades. Students contains three columns ID, Name and Marks.

![Students](https://s3.amazonaws.com/hr-challenge-images/12891/1443818166-a5c852caa0-1.png)

Grades contains the following data:

![Grades](https://s3.amazonaws.com/hr-challenge-images/12891/1443818137-69b76d805c-2.png)

Ketty gives Eve a task to generate a report containing three columns: Name, Grade and Mark. Ketty doesn't want the NAMES of those students who received a grade lower than 8. The report must be in descending order by grade -- i.e. higher grades are entered first. If there is more than one student with the same grade (8-10) assigned to them, order those particular students by their name alphabetically. Finally, if the grade is lower than 8, use "NULL" as their name and list them by their grades in descending order. If there is more than one student with the same grade (1-7) assigned to them, order those particular students by their marks in ascending order.

Write a query to help Eve.

**Sample Input**
![SampleInput](https://s3.amazonaws.com/hr-challenge-images/12891/1443818093-b79f376ec1-3.png)

**Sample Output**
```
Maria 10 99
Jane 9 81
Julia 9 88 
Scarlet 8 78
NULL 7 63
NULL 7 68
```

mysql
```mysql
select case when g.grade>=8 then s.name
            when g.grade<8 then null end as name,
       g.grade,
       s.marks
from students as s
join grades as g on s.marks>=g.min_mark and s.marks<=g.max_mark 
order by g.grade desc,s.name asc,s.marks desc
```

pandas
```python
s=read.csv('students.csv',index_col='id')
g=read.csv('grades.csv',index=None)

# use math to create grade column in s (cheating)
s['grade']=(s['marks']/10).astype(int)+1

# using IntervalIndex to create grades column
# using pd.IntervalIndex.get_loc()
s['grade']=s['marks'].apply(lambda x: g.iloc[g.index.get_loc(x)]['grade']) 
# using pd.IntervalIndex.get_indexer()
s['grade']=g.iloc[g.index.get_indexer(s['marks'])]['grade'].reset_index(drop=True) 
# using lambda function and boolean array with dot product
s['grade']=s['marks'].apply(lambda x: ((x>=g['min_mark'])&(x<=g['max_mark'])).dot(g['grade']))

# replace names for grades < 8 with 'null'
s['outputname']=s['name']
s.loc[s['grade']<8,'outputname']=None
s.sort_values(['grade','outputname','marks'],ascending=[False,True,False],inplace=True) # sort
s[['outputname','grade','marks']]
```

## 20. Top Competitors

Julia just finished conducting a coding contest, and she needs your help assembling the leaderboard! Write a query to print the respective hacker_id and name of hackers who achieved full scores for more than one challenge. Order your output in descending order by the total number of challenges in which the hacker earned a full score. If more than one hacker received full scores in same number of challenges, then sort them by ascending hacker_id.

**Input Format**

The following tables contain contest data:

* Hackers: The hacker_id is the id of the hacker, and name is the name of the hacker. 
![Hackers](https://s3.amazonaws.com/hr-challenge-images/19504/1458526776-67667350b4-ScreenShot2016-03-21at7.45.59AM.png)
* Difficulty: The difficult_level is the level of difficulty of the challenge, and score is the score of the challenge for the difficulty level. 
![Difficulty](https://s3.amazonaws.com/hr-challenge-images/19504/1458526915-57eb75d9a2-ScreenShot2016-03-21at7.46.09AM.png)
* Challenges: The challenge_id is the id of the challenge, the hacker_id is the id of the hacker who created the challenge, and difficulty_level is the level of difficulty of the challenge.
![Challenges](https://s3.amazonaws.com/hr-challenge-images/19504/1458527032-f9ca650442-ScreenShot2016-03-21at7.46.17AM.png)
* Submissions: The submission_id is the id of the submission, hacker_id is the id of the hacker who made the submission, challenge_id is the id of the challenge that the submission belongs to, and score is the score of the submission. 
![Submissions](https://s3.amazonaws.com/hr-challenge-images/19504/1458527077-298f8e922a-ScreenShot2016-03-21at7.46.29AM.png)

**Sample Input**
Hackers Table:
![HackersTable](https://s3.amazonaws.com/hr-challenge-images/19504/1458527241-6922b4ad87-ScreenShot2016-03-21at7.47.02AM.png)
Difficulty Table:
![DifficultyTable](https://s3.amazonaws.com/hr-challenge-images/19504/1458527265-7ad6852a13-ScreenShot2016-03-21at7.46.50AM.png)
Challenges Table:
![ChallengesTable](https://s3.amazonaws.com/hr-challenge-images/19504/1458527285-01e95eb6ec-ScreenShot2016-03-21at7.46.40AM.png)
Submissions Table:
![SubmissionsTable](https://s3.amazonaws.com/hr-challenge-images/19504/1458527812-479a74b99f-ScreenShot2016-03-21at8.06.05AM.png)

**Sample Output**
```
90411 Joe
```

**Explanation**

Hacker 86870 got a score of 30 for challenge 71055 with a difficulty level of 2, so 86870 earned a full score for this challenge.

Hacker 90411 got a score of 30 for challenge 71055 with a difficulty level of 2, so 90411 earned a full score for this challenge.

Hacker 90411 got a score of 100 for challenge 66730 with a difficulty level of 6, so 90411 earned a full score for this challenge.

Only hacker 90411 managed to earn a full score for more than one challenge, so we print the their hacker_id and name as space-separated values.

mysql
```mysql
select hacker_id,name
from (
    select h.hacker_id,h.name,count(s.challenge_id) as numchallenges
    from hackers as h
    join submissions as s on s.hacker_id=h.hacker_id
    join challenges as c on c.challenge_id=s.challenge_id
    join difficulty as d on d.difficulty_level=c.difficulty_level and d.score=s.score
    group by h.hacker_id,h.name
) as iq
where numchallenges>1
order by numchallenges desc, hacker_id asc
```

pandas
```python
h=pd.read_csv('hackers.csv')
s=pd.read_csv('submissions.csv')
c=pd.read_csv('challenges.csv')
d=pd.read_csv('difficulty.csv')

# using join 3 times
df=h.join(s.set_index('hacker_id',on='hacker_id',how='inner').join(c.set_index('challenge_id',on='challenge_id',how='inner').join(d.set_index(['difficulty_level','score']),on=['difficulty_level','score'],how='inner')

# using 3 merge statements
df=pd.merge(left=h,right=s,on='hacker_id',how='inner')
df=pd.merge(left=df,right=c,on='challenge_id',how='inner')
df=pd.merge(left=df,right=d,on=['difficulty_level','score'],how='inner')

groups=df.groupby(['hacker_id','name'])
result=groups['challenge_id'].count().reset_index()
result=result[result['challenge_id']>1]
result.sort_values(['numchallenges','hacker_id'],ascending=[False,True])
```

## 21. Ollivander's Inventory


Harry Potter and his friends are at Ollivander's with Ron, finally replacing Charlie's old broken wand.
Hermione decides the best way to choose is by determining the minimum number of gold galleons needed to buy each non-evil wand of high power and age. Write a query to print the id, age, coins_needed, and power of the wands that Ron's interested in, sorted in order of descending power. If more than one wand has same power, sort the result in order of descending age.



**Input Format**

The following tables contain data on the wands in Ollivander's inventory:

* Wands: The id is the id of the wand, code is the code of the wand, coins_needed is the total number of gold galleons needed to buy the wand, and power denotes the quality of the wand (the higher the power, the better the wand is). 
![Wants](https://s3.amazonaws.com/hr-challenge-images/19502/1458538092-b2a8163a74-ScreenShot2016-03-08at12.13.39AM.png)
* Wands_Property: The code is the code of the wand, age is the age of the wand, and is_evil denotes whether the wand is good for the dark arts. If the value of is_evil is 0, it means that the wand is not evil. The mapping between code and age is one-one, meaning that if there are two pairs, (code1,age1) and (code2,age2), then code1!=code2 and age1!=age2.
![Wands_Property](https://s3.amazonaws.com/hr-challenge-images/19502/1458538221-18c4092b7d-ScreenShot2016-03-08at12.13.53AM.png)


**Sample Input**
Wands Table:
![WandsTable](https://s3.amazonaws.com/hr-challenge-images/19502/1458538559-51bf29644e-ScreenShot2016-03-21at10.34.41AM.png)
Wands_Property Table:
![Wands_PropertyTable](https://s3.amazonaws.com/hr-challenge-images/19502/1458538583-fd514566f9-ScreenShot2016-03-21at10.34.28AM.png)

**Sample Output**
```
9 45 1647 10
12 17 9897 10
1 20 3688 8
15 40 6018 7
19 20 7651 6
11 40 7587 5
10 20 504 5
18 40 3312 3
20 17 5689 3
5 45 6020 2
14 40 5408 1
```

mysql
```mysql
select w2.id,iq.age,w2.coins_needed,w2.power
from (
select w.code,wp.age,w.power,min(w.coins_needed) as min_cost
from wands as w
join wands_property as wp on w.code=wp.code
where wp.is_evil=0
group by w.code,wp.age,w.power) as iq
join wands as w2 on w2.power=iq.power and w2.coins_needed=iq.min_cost and w2.code=iq.code
order by w2.power desc, iq.age desc
```

pandas
```python
w=pd.read_csv('wands.csv')
p=pd.read_csv('wands_property.csv')

df=pd.merge(left=w,right=p,on='code',how='inner')
groups=df.groupby(['code','power'])
result=groups['coins_needed'].min().reset_index()
result=pd.merge(left=result,right=w,on=['code','power','coins_needed'],how='left)
result=pd.merge(left=result,right=p,on='code',how='left')
result.sort_values(['power','age'],ascending=False)[['id','age','coins_needed','power']]
```

## 22. Challenges

Julia asked her students to create some coding challenges. Write a query to print the hacker_id, name, and the total number of challenges created by each student. Sort your results by the total number of challenges in descending order. If more than one student created the same number of challenges, then sort the result by hacker_id. If more than one student created the same number of challenges and the count is less than the maximum number of challenges created, then exclude those students from the result.

**Input Format**

The following tables contain challenge data:

* Hackers: The hacker_id is the id of the hacker, and name is the name of the hacker. 
![Hackers](https://s3.amazonaws.com/hr-challenge-images/19504/1458526776-67667350b4-ScreenShot2016-03-21at7.45.59AM.png)
* Challenges: The challenge_id is the id of the challenge, the hacker_id is the id of the hacker who created the challenge, and difficulty_level is the level of difficulty of the challenge.
![Challenges](https://s3.amazonaws.com/hr-challenge-images/19506/1458521079-549341d9ec-ScreenShot2016-03-21at6.07.03AM.png)
---
**Sample Input 0**

Hackers Table:
![HackersTable](https://s3.amazonaws.com/hr-challenge-images/19506/1458521384-34c6866dae-ScreenShot2016-03-21at6.07.15AM.png)
Challenges Table:
![ChallengesTable](https://s3.amazonaws.com/hr-challenge-images/19506/1458521410-befa8e1cd9-ScreenShot2016-03-21at6.07.25AM.png)

**Sample Output 0**
```
21283 Angela 6
88255 Patrick 5
96196 Lisa 1
```

**Sample Input 1**

Hackers Table:
![HackersTable](https://s3.amazonaws.com/hr-challenge-images/19506/1458521469-87036deea3-ScreenShot2016-03-21at6.07.48AM.png)
Challenges Table:
![ChallengesTable](https://s3.amazonaws.com/hr-challenge-images/19506/1458521490-358215cf0b-ScreenShot2016-03-21at6.07.58AM.png)

**Sample Output 1**
```
12299 Rose 6
34856 Angela 6
79345 Frank 4
80491 Patrick 3
81041 Lisa 1
```


**Explanation**

For Sample Case 0, we can get the following details: 
![Explanation0](https://s3.amazonaws.com/hr-challenge-images/19506/1458521677-fd04c384c0-ScreenShot2016-03-21at6.07.38AM.png)

Students 5077 and 62743 both created  challenges, but the maximum number of challenges created is  so these students are excluded from the result.

For Sample Case 1, we can get the following details: 
![Explanation1](https://s3.amazonaws.com/hr-challenge-images/19506/1458521836-24039e7523-ScreenShot2016-03-21at6.08.08AM.png)

Students 12299 and 34856 both created  challenges. Because  is the maximum number of challenges created, these students are included in the result.

mysql
```mysql
select h.hacker_id,h.name, count(c.challenge_id) as ccount
from
hackers as h
join challenges as c on h.hacker_id=c.hacker_id
group by h.hacker_id,h.name
having ccount= (select max(t1.ccount) from (
                    select count(challenge_id) as ccount from challenges group by hacker_id 
                    ) as t1
)
or ccount in (
    select t2.ccount from (
        select count(challenge_id) as ccount from challenges group by hacker_id
    ) as t2
    group by t2.ccount having count(t2.ccount)=1
)
order by ccount desc, h.hacker_id asc

```

pandas
```python
h=pd.read_csv('hackers.csv')
c=pd.read_csv('challenges.csv')

# merge and group
df=pd.merge(left=h,right=c,on='hacker_id',how='inner')
groups=df.groupby(['hacker_id','name'])

# count number of challenges and sort
result=groups['challenge_id'].count().reset_index()
result.sort_values(['challenge_id','hacker_id'],ascending=[False,True],inplace=True)

# get max challenge count
max_count=result['challenge_id'].max()
# get array of challenge counts with one hacker_id with value_counts
ccounts=result['challenge_id'].value_counts()
singles=ccounts[ccounts==1].index
# OR get unique list of challenge counts
singles=result['challenge_id'].drop_duplicates(keep=False).index
# OR get boolean array mask of rows with only one value of challenge count
singles2=result.groupby('challenge_id')['challenge_id'].transform('size')==1

# filter results
results.loc[(results['challenge_id']==max_count)&(results['challenge_id'].isin(singles))]
# OR filter using the boolean mask
results.loc[(results['challenge_id']==max_count)&(singles2)]
```

## 23. Contest Leaderboard

You did such a great job helping Julia with her last coding contest challenge that she wants you to work on this one, too!

The total score of a hacker is the sum of their maximum scores for all of the challenges. Write a query to print the hacker_id, name, and total score of the hackers ordered by the descending score. If more than one hacker achieved the same total score, then sort the result by ascending hacker_id. Exclude all hackers with a total score of 0 from your result.

**Input Format**

The following tables contain contest data:

* Hackers: The hacker_id is the id of the hacker, and name is the name of the hacker. 
![Hackers](https://s3.amazonaws.com/hr-challenge-images/19503/1458522826-a9ddd28469-ScreenShot2016-03-21at6.40.27AM.png)
* Submissions: The submission_id is the id of the submission, hacker_id is the id of the hacker who made the submission, challenge_id is the id of the challenge for which the submission belongs to, and score is the score of the submission. 
![Submissions](https://s3.amazonaws.com/hr-challenge-images/19503/1458523022-771511df90-ScreenShot2016-03-21at6.40.37AM.png)
---
**Sample Input**

Hackers Table:
![HackersTable](https://s3.amazonaws.com/hr-challenge-images/19503/1458523374-7ecc39010f-ScreenShot2016-03-21at6.51.56AM.png)
Submissions Table:
![SubmissionsTable](https://s3.amazonaws.com/hr-challenge-images/19503/1458523388-0896218137-ScreenShot2016-03-21at6.51.45AM.png)

**Sample Output**
```
4071 Rose 191
74842 Lisa 174
84072 Bonnie 100
4806 Angela 89
26071 Frank 85
80305 Kimberly 67
49438 Patrick 43
```

**Explanation**

Hacker 4071 submitted solutions for challenges 19797 and 49593, so the total score = 95+max(43,96) = 191.

Hacker 74842 submitted solutions for challenges 19797 and 63132, so the total score = max(98,5) + 76 = 174.

Hacker 84072 submitted solutions for challenges 49593 and 63132, so the total score = 100 + 0 = 100.

The total scores for hackers 4806, 26071, 80305, and 49438 can be similarly calculated.

mysql
```mysql
select t1.hacker_id,h.name,sum(t1.maxscore) as totalscore
from(
select hacker_id,challenge_id,max(score) as maxscore
from submissions 
group by hacker_id,challenge_id
) as t1
join hackers as h on h.hacker_id=t1.hacker_id
group by t1.hacker_id,h.name
having totalscore>0
order by totalscore desc, t1.hacker_id asc
```

pandas
```python
# merge data
df=pd.merge(left=h,right=s,on='hacker_id',how='inner')

# get max score per hacker and challenge
groups=df.groupby(['hacker_id','name','challenge_id'])
maxscores=groups['score'].max().reset_index()

# calc total scores
totalscores=maxscores.groupby(['hacker_id','name'])['score'].sum().reset_index()

# order result by total score descending, hacker_id ascending
totalscores.sort_values(['score','hacker_id'],ascending=[False,True],inplace=True)

# filter out hackers with totalscore=0
totalscores.loc[totalscores['score']>0]
```

## 24. Projects

You are given a table, Projects, containing three columns: Task_ID, Start_Date and End_Date. It is guaranteed that the difference between the End_Date and the Start_Date is equal to 1 day for each row in the table.

![Projects](https://s3.amazonaws.com/hr-challenge-images/12894/1443819551-639948acc0-1.png)

If the End_Date of the tasks are consecutive, then they are part of the same project. Samantha is interested in finding the total number of different projects completed.

Write a query to output the start and end dates of projects listed by the number of days it took to complete the project in ascending order. If there is more than one project that have the same number of completion days, then order by the start date of the project.

**Sample Input**

![Input](https://s3.amazonaws.com/hr-challenge-images/12894/1443819440-1c40e943a1-2.png)

**Sample Output**
```
2015-10-28 2015-10-29
2015-10-30 2015-10-31
2015-10-13 2015-10-15
2015-10-01 2015-10-04
```

**Explanation**

The example describes following four projects:

* Project 1: Tasks 1, 2 and 3 are completed on consecutive days, so these are part of the project. Thus start date of project is 2015-10-01 and end date is 2015-10-04, so it took 3 days to complete the project.
* Project 2: Tasks 4 and 5 are completed on consecutive days, so these are part of the project. Thus, the start date of project is 2015-10-13 and end date is 2015-10-15, so it took 2 days to complete the project.
* Project 3: Only task 6 is part of the project. Thus, the start date of project is 2015-10-28 and end date is 2015-10-29, so it took 1 day to complete the project.
* Project 4: Only task 7 is part of the project. Thus, the start date of project is 2015-10-30 and end date is 2015-10-31, so it took 1 day to complete the project.

mysql
```mysql
select s.start_date,min(e.end_date)
from 
(select start_date from projects where start_date not in (select end_date from projects)) as s
join 
(select end_date from projects where end_date not in (select start_date from projects)) as e
on s.start_date<e.end_date
group by s.start_date
order by datediff(min(e.end_date),s.start_date),s.start_date

--alternatively (incomplete)

select start_date,
        end_date,
        sum(nc) over (order by end_date rows between unbounded preceeding and current row) as pnum
from(
    select start_date, end_date, case when end_date-lagdate=1 then 1 
                                      when lagdate is null then 1 end as nc
    from(
        select start_date, end_date, lag(end_date) over (order by end_date) as lagdate from projects
    ) as t1
) as t2
```

pandas
```python
p=pd.read_csv('projects.csv')
# generate data
p = pd.DataFrame({'task_id': [1,2,3,4,5,6,7],
                'start_date': pd.to_datetime(['2015-10-01', '2015-10-02', '2015-10-03', '2015-10-13',
                                              '2015-10-14', '2015-10-28', '2015-10-30']),
                  'end_date': pd.to_datetime(['2015-10-02', '2015-10-03', '2015-10-04', '2015-10-14', 
                                              '2015-10-15', '2015-10-29', '2015-10-31'])})

# shift end date column down
p['shifted']=p['end_date'].shift(1)
# flag first row of each new project by identifying non-continuous task end dates
p['nc']=(p['end_date']-p['shifted']).dt.days>1 | p['shifted'].isnull()
# label rows with project number using cumulative sum on flag column
p['projectnum']=p['nc'].cumsum() 
# group by project number, sort and display results
groups=p.groupby('projectnum')
       
# create result set indexed on project number
p2=groups['start_date'].first().to_frame()
p2['end_date']=groups['end_date'].last()
p2['pduration']=p2['end_date']-p2['start_date']
p2.sort_values(['pduration','start_date'],ascending=True,inplace=True)
p[['start_date','end_date']]
```

## 25. Placements

You are given three tables: Students, Friends and Packages. Students contains two columns: ID and Name. Friends contains two columns: ID and Friend_ID (ID of the ONLY best friend). Packages contains two columns: ID and Salary (offered salary in thousands of dollars per month).

![data](https://s3.amazonaws.com/hr-challenge-images/12895/1443820186-2a9b4939a8-1.png)

Write a query to output the names of those students whose best friends got offered a higher salary than them. Names must be ordered by the salary amount offered to the best friends. It is guaranteed that no two students got same salary offer.

**Sample Input**

![input1](https://s3.amazonaws.com/hr-challenge-images/12895/1443820100-adb691b2f5-2_2.png)
![input2](https://s3.amazonaws.com/hr-challenge-images/12895/1443820079-9bd1e231b1-2_1.png)

**Sample Output**
```
Samantha
Julia
Scarlet
```

**Explanation**

See the following table:

![explanation](https://s3.amazonaws.com/hr-challenge-images/12895/1443819966-c37c146d27-3.png)

Now,

* Samantha's best friend got offered a higher salary than her at 11.55
* Julia's best friend got offered a higher salary than her at 12.12
* Scarlet's best friend got offered a higher salary than her at 15.2
* Ashley's best friend did NOT get offered a higher salary than her

The name output, when ordered by the salary offered to their friends, will be:

* Samantha
* Julia
* Scarlet

mysql
```mysql
select s.name
from students as s
join friends as f on f.id=s.id
join packages as p on p.id=s.id
join packages as p2 on p2.id=f.friend_id
where p2.salary>p.salary
order by p2.salary
```

pandas
```python
s=pd.read_csv('students.csv',index_col='id')
f=pd.read_csv('friends.csv',index_col='id')
p=pd.read_csv('packages.csv',index_col='id')

df=s.copy()
s['friend_id']=f['friend_id']
df['salary']=p['salary']
df=pd.merge(left=df,right=p,left_on='friend_id',right_on=p.index,how='inner')
df=df.sort_values('salary_y',ascending=True)
df.loc[df['salary_y']>df['salary_x'],'name']
```

## 26. Symmetric Pairs

You are given a table, Functions, containing two columns: X and Y.

![data](https://s3.amazonaws.com/hr-challenge-images/12892/1443818798-51909e977d-1.png)

Two pairs ($X_{1}$, $Y_{1}$) and ($X_{2}$, $Y_{2}$) are said to be symmetric pairs if $X_{1}$ = $Y_{2}$ and $X_{2}$ = $Y_{1}$.

Write a query to output all such symmetric pairs in ascending order by the value of X.

**Sample Input**

![input1](https://s3.amazonaws.com/hr-challenge-images/12892/1443818693-b384c24e35-2.png)

**Sample Output**
```
20 20
20 21
22 23
```

mysql
```mysql
select distinct f1.X,f1.Y
from functions as f1
join functions as f2 on f1.X=f2.Y and f1.Y=f2.X
where f1.X<f2.X
or f1.X in (select X from functions where X=Y group by X having count(*)>1)
order by f1.X
```

pandas
```python

```

## 27. The Pads

Generate the following two result sets:

1. Query an alphabetically ordered list of all names in OCCUPATIONS, immediately followed by the first letter of each profession as a parenthetical (i.e.: enclosed in parentheses). For example: AnActorName(A), ADoctorName(D), AProfessorName(P), and ASingerName(S).
2. Query the number of ocurrences of each occupation in OCCUPATIONS. Sort the occurrences in ascending order, and output them in the following format: 
```
There are a total of [occupation_count] [occupation]s.
```
where [occupation_count] is the number of occurrences of an occupation in OCCUPATIONS and [occupation] is the lowercase occupation name. If more than one Occupation has the same [occupation_count], they should be ordered alphabetically.

**Note:** There will be at least two entries in the table for each type of occupation.

**Input Format**

The OCCUPATIONS table is described as follows: 
![input](https://s3.amazonaws.com/hr-challenge-images/12889/1443816414-2a465532e7-1.png)
Occupation will only contain one of the following values: Doctor, Professor, Singer or Actor.

**Sample Input**

An OCCUPATIONS table that contains the following records:

![sampleinput](https://s3.amazonaws.com/hr-challenge-images/12889/1443816608-0b4d01d157-2.png)

**Sample Output**
```
Ashely(P)
Christeen(P)
Jane(A)
Jenny(D)
Julia(A)
Ketty(P)
Maria(A)
Meera(S)
Priya(S)
Samantha(D)
There are a total of 2 doctors.
There are a total of 2 singers.
There are a total of 3 actors.
There are a total of 3 professors.
```

**Explanation**

The results of the first query are formatted to the problem description's specifications. 
The results of the second query are ascendingly ordered first by number of names corresponding to each profession (2<=2<=3<=3), and then alphabetically by profession (doctor<=singer, and actor<=professor).

mysql
```mysql
select concat(name,'(',substring(occupation from 1 for 1),')')
from occupations
order by name;

select concat('There are a total of ',count(*),' ',lower(occupation),'s.')
from occupations
group by occupation
order by count(*),occupation;
```

pandas
```python
o=pd.read_csv('occupations.csv')

# output list of names and occupations
o['output']=o['name']+'('+o['occupation'].str[0]+')'
o=o.sort_values('name')
o['output']

# output occupation counts
o2=o.groupby('occupation')['occupation'].count().reset_index()
o2.columns=['occupation','count']
o2['output']='There are a total of ',o2['count'].astype(str)+' '+o2['occupation']+'s.'
o2['output]
```

## 28. Occupations

Pivot the Occupation column in OCCUPATIONS so that each Name is sorted alphabetically and displayed underneath its corresponding Occupation. The output column headers should be Doctor, Professor, Singer, and Actor, respectively.

**Note:** Print NULL when there are no more names corresponding to an occupation.

**Input Format**

The OCCUPATIONS table is described as follows:
![input](https://s3.amazonaws.com/hr-challenge-images/12889/1443816414-2a465532e7-1.png)
Occupation will only contain one of the following values: Doctor, Professor, Singer or Actor.

**Sample Input**

An OCCUPATIONS table that contains the following records:

![sampleinput](https://s3.amazonaws.com/hr-challenge-images/12890/1443817648-1b2b8add45-2.png)

**Sample Output**
```
Jenny    Ashley     Meera  Jane
Samantha Christeen  Priya  Julia
NULL     Ketty      NULL   Maria
```

**Explanation**

The first column is an alphabetically ordered list of Doctor names. 

The second column is an alphabetically ordered list of Professor names. 

The third column is an alphabetically ordered list of Singer names. 

The fourth column is an alphabetically ordered list of Actor names. 

The empty cell data for columns with less than the maximum number of names per occupation (in this case, the Professor and Actor columns) are filled with NULL values.

mysql
```mysql
# for version 8.0+
select rn,
max(case when occupation='Doctor' then name else null end) as Doctor,
max(case when occupation='Professor' then name else null end) as Professor,
max(case when occupation='Singer' then name else null end) as Singer,
max(case when occupation='Actor' then name else null end) as Actor
from (
    select *, row_number() over (partition by occupation order by name) as rn from occupations
) as t1
group by rn

# for version 5.7
select 
    max(case when occupation='Doctor' then name else null end) as Doctor,
    max(case when occupation='Professor' then name else null end) as Professor,
    max(case when occupation='Singer' then name else null end) as Singer,
    max(case when occupation='Actor' then name else null end) as Actor
from (
  select *, 
  ( case occupation 
         when @curOccupation
         then @curRow := @curRow + 1 
         else @curRow := 1 and @curOccupation := occupation 
   end
  ) + 1 AS rn
  from occupations, (select @curRow := 0, @curOccupation := '') r
  order by occupation,name
) t1
group by rn

```

pandas
```python

```

## 29. Binary Tree Nodes

Pivot the Occupation column in OCCUPATIONS so that each Name is sorted alphabetically and displayed underneath its You are given a table, BST, containing two columns: N and P, where N represents the value of a node in Binary Tree, and P is the parent of N.

![BST](https://s3.amazonaws.com/hr-challenge-images/12888/1443818507-5095ab9853-1.png)

Write a query to find the node type of Binary Tree ordered by the value of the node. Output one of the following for each node:

* Root: If node is root node.
* Leaf: If node is leaf node.
* Inner: If node is neither root nor leaf node.

**Sample Input**

![sampleinput](https://s3.amazonaws.com/hr-challenge-images/12888/1443818467-30644673f6-2.png)

**Sample Output**
```
1 Leaf
2 Inner
3 Leaf
5 Root
6 Leaf
8 Inner
9 Leaf
```

**Explanation**

The Binary Tree below illustrates the sample:

![explanation](https://s3.amazonaws.com/hr-challenge-images/12888/1443773633-f9e6fd314e-simply_sql_bst.png)

mysql
```mysql
# for version 8.0+
with recursive cte(N,P) as (
    select N, 'Root' as type
    from BST where P is null
    union all
    select N, case when N not in (select P from BST) then 'Leaf' else 'Inner' end as type
    from BST as b
    join cte on b.P=cte.N
)
select N,type from cte order by N;

# for version 5.7
select N, case when P is null then 'Root' 
               when N in (select P from BST) then 'Inner' 
               else 'Leaf' end as type
from BST order by N;

```

pandas
```python

```

## 29. New Companies

Amber's conglomerate corporation just acquired some new companies. Each of the companies follows this hierarchy:

![companies](https://s3.amazonaws.com/hr-challenge-images/19505/1458531031-249df3ae87-ScreenShot2016-03-21at8.59.56AM.png)

Given the table schemas below, write a query to print the company_code, founder name, total number of lead managers, total number of senior managers, total number of managers, and total number of employees. Order your output by ascending company_code.

**Note:**

* The tables may contain duplicate records.
* The company_code is string, so the sorting should not be numeric. For example, if the company_codes are C_1, C_2, and C_10, then the ascending company_codes will be C_1, C_10, and C_2.

**Input Format**

The following tables contain company data:

* Company: The company_code is the code of the company and founder is the founder of the company.
![company](https://s3.amazonaws.com/hr-challenge-images/19505/1458531125-deb0a57ae1-ScreenShot2016-03-21at8.50.04AM.png)

* Lead_Manager: The lead_manager_code is the code of the lead manager, and the company_code is the code of the working company. 
![leadmanager](https://s3.amazonaws.com/hr-challenge-images/19505/1458534960-2c6d764e3c-ScreenShot2016-03-21at8.50.12AM.png)

* Senior_Manager: The senior_manager_code is the code of the senior manager, the lead_manager_code is the code of its lead manager, and the company_code is the code of the working company. 
![seniormanager](https://s3.amazonaws.com/hr-challenge-images/19505/1458534973-6548194998-ScreenShot2016-03-21at8.50.21AM.png)

* Manager: The manager_code is the code of the manager, the senior_manager_code is the code of its senior manager, the lead_manager_code is the code of its lead manager, and the company_code is the code of the working company.
![manager](https://s3.amazonaws.com/hr-challenge-images/19505/1458534988-7fc0af46ce-ScreenShot2016-03-21at8.50.29AM.png)

* Employee: The employee_code is the code of the employee, the manager_code is the code of its manager, the senior_manager_code is the code of its senior manager, the lead_manager_code is the code of its lead manager, and the company_code is the code of the working company.
![employee](https://s3.amazonaws.com/hr-challenge-images/19505/1458535002-d47f63cbb4-ScreenShot2016-03-21at8.50.41AM.png)

**Sample Input**

Company Table: 
![companytable](https://s3.amazonaws.com/hr-challenge-images/19505/1458535049-2a207c44b3-ScreenShot2016-03-21at8.50.52AM.png)

Lead_Manager Table: 
![leadmanagertable](https://s3.amazonaws.com/hr-challenge-images/19505/1458535073-919107f639-ScreenShot2016-03-21at8.51.03AM.png)

Senior_Manager Table: 
![seniormanagertable](https://s3.amazonaws.com/hr-challenge-images/19505/1458535111-b1c48335b3-ScreenShot2016-03-21at8.51.15AM.png)

Manager Table: 
![managertable](https://s3.amazonaws.com/hr-challenge-images/19505/1458535122-888f4bf340-ScreenShot2016-03-21at8.51.26AM.png)

Employee Table:
![employeetable](https://s3.amazonaws.com/hr-challenge-images/19505/1458535134-878767e0d9-ScreenShot2016-03-21at8.51.52AM.png)

**Sample Output**
```
C1 Monika 1 2 1 2
C2 Samantha 1 1 2 2
```

**Explanation**

In company C1, the only lead manager is LM1. There are two senior managers, SM1 and SM2, under LM1. There is one manager, M1, under senior manager SM1. There are two employees, E1 and E2, under manager M1.

In company C2, the only lead manager is LM2. There is one senior manager, SM3, under LM2. There are two managers, M2 and M3, under senior manager SM3. There is one employee, E3, under manager M2, and another employee, E4, under manager, M3.

mysql
```mysql
select e.company_code,c.founder,
       count(distinct e.lead_manager_code),
       count(distinct e.senior_manager_code),
       count(distinct e.manager_code),
       count(distinct e.employee_code)
from employee as e
join company as c on c.company_code=e.company_code
group by e.company_code,c.founder
order by e.company_code
```

pandas
```python
e=pd.read_csv('employee.csv')
c=pd.read_csv('company.csv')

# join and group data
df=pd.merge(left=e,right=c,on='company_code',how='inner')
groups=df.groupby(['company_code','founder'])

# count unique employees and output sorted results
groups.nunique().reset_index().sort_values('company_code')

```

## 30. The Blunder

Samantha was tasked with calculating the average monthly salaries for all employees in the EMPLOYEES table, but did not realize her keyboard's 0 key was broken until after completing the calculation. She wants your help finding the difference between her miscalculation (using salaries with any zeroes removed), and the actual average salary.

Write a query calculating the amount of error (i.e.: actual-miscalculated average monthly salaries), and round it up to the next integer.

**Input Format**

The EMPLOYEES table is described as follows:
![employees](https://s3.amazonaws.com/hr-challenge-images/12893/1443817108-adc2235c81-1.png)

**Note:** Salary is measured in dollars per month and its value is $<10^{5}$.

**Sample Input**

![sampleinput](https://s3.amazonaws.com/hr-challenge-images/12893/1443817161-299cc6eb7f-2.png)

**Sample Output**
```
2061
```

**Explanation**

The table below shows the salaries without zeroes as they were entered by Samantha:

![explanation](https://s3.amazonaws.com/hr-challenge-images/12893/1443817229-eb00d44a3b-3.png)

Samantha computes an average salary of 98.00. The actual average salary is 2159.00.

The resulting error between the two calculations is 2159.00-98.00=2061.00 which, when rounded to the next integer, is 2061.

mysql
```mysql
select ceiling(avg(salary)-avg(convert(replace(convert(salary,char),'0',''),signed int)))
from employees
```

pandas
```python

```

## 30. Top Earners

We define an employee's total earnings to be their monthly salary x months worked, and the maximum total earnings to be the maximum total earnings for any employee in the Employee table. Write a query to find the maximum total earnings for all employees as well as the total number of employees who have maximum total earnings. Then print these values as  2 space-separated integers.

**Input Format**

The EMPLOYEE table containing employee data for a company is described as follows:
![employee](https://s3.amazonaws.com/hr-challenge-images/19629/1458557872-4396838885-ScreenShot2016-03-21at4.27.13PM.png)

where employee_id is an employee's ID number, name is their name, months is the total number of months they've been working for the company, and salary is the their monthly salary.

**Sample Input**

![sampleinput](https://s3.amazonaws.com/hr-challenge-images/19631/1458559098-23bf583125-ScreenShot2016-03-21at4.32.59PM.png)

**Sample Output**
```
69952 1
```

**Explanation**

The table and earnings data is depicted in the following diagram:

![explanation](https://s3.amazonaws.com/hr-challenge-images/19631/1458559218-9f37585c7a-ScreenShot2016-03-21at4.49.23PM.png)

The maximum earnings value is 69952. The only employee with earnings=69952 is Kimberly, so we print the maximum earnings value (69952) and a count of the number of employees who have earned $69952 (which is 1) as two space-separated values.

mysql
```mysql
select max(months*salary), count(*)
from employee
where months*salary=(select max(months*salary) from employee)
```

pandas
```python
e=pd.read_csv('employee.csv')

e['earnings']=e['months']*e['salary']
maxearnings=e['earnings'].max()
print(maxearnings,' ',(e['earnings']==maxearnings).sum())
```

## 30. Weather Observation Station 2

Query the following two values from the STATION table:

1. The sum of all values in LAT_N rounded to a scale of 2 decimal places.
2. The sum of all values in LONG_W rounded to a scale of 2 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

**Output Format**
Your results must be in the form:
```
lat lon
```

where lat is the sum of all values in LAT_N and lon is the sum of all values in LONG_W. Both results must be rounded to a scale of 2 decimal places.

mysql
```mysql
select round(sum(LAT_N),2),round(sum(LONG_W),2)
from station
```

pandas
```python
s=pd.read_csv('station.csv)
s[['LAT_N','LONG_W']].sum().round(2)
```

## 31. Weather Observation Station 13

Query the sum of Northern Latitudes (LAT_N) from STATION having values greater than 38.7880 and less than 137.2345. Truncate your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(sum(LAT_N),4)
from station
where LAT_N>38.7880 and LAT_N<137.2345
```

pandas
```python
s=pd.read_csv('station.csv)
s.loc[(s['LAT_N']>38.7880)&(s['LAT_N']<137.2345),'LAT_N'].sum().round(4)
```

## 32. Weather Observation Station 14

Query the greatest value of the Northern Latitudes (LAT_N) from STATION that is less than 137.2345. Truncate your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(max(LAT_N),4)
from station
where LAT_N<137.2345
```

pandas
```python
s=pd.read_csv('station.csv)
s.loc[(s['LAT_N']<137.2345),'LAT_N'].max().round(4)
```

## 33. Weather Observation Station 15

Query the Western Longitude (LONG_W) for the largest Northern Latitude (LAT_N) in STATION that is less than 137.2345. Round your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(LONG_W,4)
from station
where LAT_N<137.2345
order by LAT_N desc limit 1
```

pandas
```python
s=pd.read_csv('station.csv)
s=s.sort_values('LAT_N',ascending=False)
s.loc[(s['LAT_N']<137.2345),'LONG_W'].iloc[1].round(4)
```

## 34. Weather Observation Station 16

Query the smallest Northern Latitude (LAT_N) from STATION that is greater than 38.7880. Round your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(min(LAT_N),4)
from station
where LAT_N>38.7880
```

pandas
```python
s=pd.read_csv('station.csv)
s.loc[(s['LAT_N']>38.7880),'LAT_N'].min().round(4)
```

## 35. Weather Observation Station 17

Query the Western Longitude (LONG_W) where the smallest Northern Latitude (LAT_N) in STATION is greater than 38.7880. Round your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(LONG_W,4)
from station
where LAT_N>38.7880
order by LAT_N asc limit 1
```

pandas
```python
s=pd.read_csv('station.csv)
s=s.sort_values('LAT_N',ascending=True)
s.loc[(s['LAT_N']>38.7880),'LONG_W'].iloc[1].round(4)
```

## 36. Weather Observation Station 18

Consider $P_{1}(a,b)$ and $P_{2}(c,d)$ to be two points on a 2D plane.

* a happens to equal the minimum value in Northern Latitude (LAT_N in STATION).
* b happens to equal the minimum value in Western Longitude (LONG_W in STATION).
* c happens to equal the maximum value in Northern Latitude (LAT_N in STATION).
* d happens to equal the maximum value in Western Longitude (LONG_W in STATION).

Query the [Manhattan Distance](https://xlinux.nist.gov/dads/HTML/manhattanDistance.html) between points $P_{1}$ and $P_{2}$ and round it to a scale of 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(max(LAT_N)-min(LAT_N)+max(LONG_W)-min(LONG_W),4)
from station
```

pandas
```python
s=pd.read_csv('station.csv)
latd=s['LAT_N'].max()-s['LAT_N'].min()
longd=s['LONG_W'].max()-s['LONG_W'].min()
round(latd+longd,4)
```

## 37. Weather Observation Station 19

Consider $P_{1}(a,c)$ and $P_{2}(b,d)$ to be two points on a 2D plane where (a,b) are the respective minimum and maximum values of Northern Latitude (LAT_N) and (c,d) are the respective minimum and maximum values of Western Longitude (LONG_W) in STATION.

Query the [Euclidean Distance](https://en.wikipedia.org/wiki/Euclidean_distance) between points $P_{1}$ and $P_{2}$ and format your answer to display 4 decimal digits.


**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
select round(sqrt(pow(max(LAT_N)-min(LAT_N),2)+pow(max(LONG_W)-min(LONG_W),2)),4)
from station
```

pandas
```python
s=pd.read_csv('station.csv')
(((s['LAT_N'].max()-s['LAT_N'].min())**2+(s['LONG_W'].max()-s['LONG_W'].min())**2)**0.5).round(4)
```

## 38. Weather Observation Station 20

A [median](https://en.wikipedia.org/wiki/Median) is defined as a number separating the higher half of a data set from the lower half. Query the median of the Northern Latitudes (LAT_N) from STATION and round your answer to 4 decimal places.

**Input Format**

The STATION table is described as follows:

![station](https://s3.amazonaws.com/hr-challenge-images/9336/1449345840-5f0a551030-Station.jpg)

where LAT_N is the northern latitude and LONG_W is the western longitude.

mysql
```mysql
set @rn=0;
select round(avg(LAT_N),4)
from (
select (@rn:=@rn+1) as rn, LAT_N
from station order by LAT_N) as t1
where rn = (select ceil(count(*)/2) from station)
or rn = (select floor(count(*)/2)+1 from station);
```

pandas
```python
s=pd.read_csv('station.csv')
s['LAT_N'].median()
```

## 39. Interviews

Samantha interviews many candidates from different colleges using coding challenges and contests. Write a query to print the contest_id, hacker_id, name, and the sums of total_submissions, total_accepted_submissions, total_views, and total_unique_views for each contest sorted by contest_id. Exclude the contest from the result if all four sums are 0.

**Note:** A specific contest can be used to screen candidates at more than one college, but each college only holds 1 screening contest.

---
**Input Format**

The following tables hold interview data:

* Contests: The contest_id is the id of the contest, hacker_id is the id of the hacker who created the contest, and name is the name of the hacker. 
![contests](https://s3.amazonaws.com/hr-challenge-images/19596/1458517426-e017c3460e-ScreenShot2016-03-21at4.57.47AM.png)

* Colleges: The college_id is the id of the college, and contest_id is the id of the contest that Samantha used to screen the candidates. 
![colleges](https://s3.amazonaws.com/hr-challenge-images/19596/1458517503-fd4aa63111-ScreenShot2016-03-21at4.57.56AM.png)

* Challenges: The challenge_id is the id of the challenge that belongs to one of the contests whose contest_id Samantha forgot, and college_id is the id of the college where the challenge was given to candidates. 
![challenges](https://s3.amazonaws.com/hr-challenge-images/19596/1458517661-a642f750ce-ScreenShot2016-03-21at4.58.04AM.png)

* View_Stats: The challenge_id is the id of the challenge, total_views is the number of times the challenge was viewed by candidates, and total_unique_views is the number of times the challenge was viewed by unique candidates. 
![view_stats](https://s3.amazonaws.com/hr-challenge-images/19596/1458517983-b4302286a8-ScreenShot2016-03-21at4.58.15AM.png)

* Submission_Stats: The challenge_id is the id of the challenge, total_submissions is the number of submissions for the challenge, and total_accepted_submission is the number of submissions that achieved full scores. 
![submission_stats](https://s3.amazonaws.com/hr-challenge-images/19596/1458518090-80983c916a-ScreenShot2016-03-21at4.58.27AM.png)
---
**Sample Input**

Contests Table:  
![conteststable](https://s3.amazonaws.com/hr-challenge-images/19596/1458519044-d788f8a6ee-ScreenShot2016-03-21at4.58.39AM.png)

Colleges Table:  
![collegestable](https://s3.amazonaws.com/hr-challenge-images/19596/1458519098-912836d6ac-ScreenShot2016-03-21at4.59.22AM.png)

Challenges Table:  
![challengestable](https://s3.amazonaws.com/hr-challenge-images/19596/1458519120-c531743caf-ScreenShot2016-03-21at4.59.32AM.png)

View_Stats Table: 
![view_statstable](https://s3.amazonaws.com/hr-challenge-images/19596/1458519152-107a67866b-ScreenShot2016-03-21at4.59.43AM.png)

Submission_Stats Table: 
![submission_statstable](https://s3.amazonaws.com/hr-challenge-images/19596/1458519173-091aba871a-ScreenShot2016-03-21at4.59.55AM.png)

**Sample Output**
```
66406 17973 Rose 111 39 156 56
66556 79153 Angela 0 0 11 10
94828 80275 Frank 150 38 41 15
```

**Explanation**

The contest 66046 is used in the college 11219. In this college 11219, challenges 18765 and 47127 are asked, so from the view and submission stats:

* Sum of total submissions = 27+56+28 = 111

* Sum of total accepted submissions = 10+18+11 = 39

* Sum of total views = 43+72+26+15 = 156

* Sum of total unique views = 10+13+19+14 = 56

Similarly, we can find the sums for contests 66556 and 94828.

mysql
```mysql
SELECT 
ct.contest_id,ct.hacker_id,ct.name,
ifnull(sum(s.total_submissions),0),
ifnull(sum(s.total_accepted_submissions),0),
ifnull(sum(v.total_views),0),
ifnull(sum(v.total_unique_views),0)
from contests as ct
join colleges as co on ct.contest_id=co.contest_id
join challenges as ch on ch.college_id=co.college_id 
left join (select challenge_id, 
                  sum(total_views) as total_views, 
                  sum(total_unique_views) as total_unique_views
      from view_stats group by challenge_id) as v
on v.challenge_id=ch.challenge_id
left join (select challenge_id, 
                  sum(total_submissions) as total_submissions, 
                  sum(total_accepted_submissions) as total_accepted_submissions
      from submission_stats group by challenge_id) as s
on s.challenge_id=ch.challenge_id
-- where ct.contest_id=845 and co.college_id=96 and ch.challenge_id=97
-- and s.challenge_id is null
group by ct.contest_id,ct.hacker_id,ct.name
having sum(v.total_views)>0 or sum(v.total_unique_views)>0 
or sum(s.total_submissions)>0 or sum(s.total_accepted_submissions)>0
order by ct.contest_id asc
```

pandas
```python
ct=pd.read_csv('contests.csv')
co=pd.read_csv('colleges.csv')
ch=pd.read_csv('challenges.csv')
v=pd.read_csv('view_stats.csv')
s=pd.read_csv('submission_stats.csv')

df=pd.merge(left=ct,right=co,on='contest_id',how='inner')
df=pd.merge(left=df,right=ch,on='college_id',how='inner')
df=pd.merge(left=df,right=v,on='challenge_id',how='left') # left merge because not every challenge_id has views
df=pd.merge(left=df,right=s,on='challenge_id',how='left') # left merge because not every challenge_id has submissions

r=df.groupby(['contest_id','hacker_id','name'])[['total_submissions','total_accepted_submissions','total_views','total_unique_views']].fillna(0).sum()

```

## 40. 15 Days of Learning SQL

Julia conducted a 15 days of learning SQL contest. The start date of the contest was March 01, 2016 and the end date was March 15, 2016.

Write a query to print total number of unique hackers who made at least 1 submission each day (starting on the first day of the contest), and find the hacker_id and name of the hacker who made maximum number of submissions each day. If more than one such hacker has a maximum number of submissions, print the lowest hacker_id. The query should print this information for each day of the contest, sorted by the date.

---

**Input Format**

The following tables contain contest data:

* Hackers: The hacker_id is the id of the hacker, and name is the name of the hacker.
![Hackers](https://s3.amazonaws.com/hr-challenge-images/19597/1458511164-12adec3b8b-ScreenShot2016-03-21at3.26.47AM.png)
* Submissions: The submission_date is the date of the submission, submission_id is the id of the submission, hacker_id is the id of the hacker who made the submission, and score is the score of the submission. 
![Submissions](https://s3.amazonaws.com/hr-challenge-images/19597/1458511251-0b534030b9-ScreenShot2016-03-21at3.26.56AM.png)

**Sample Input**

For the following sample input, assume that the end date of the contest was March 06, 2016.

Hackers Table:
![HackersTable](https://s3.amazonaws.com/hr-challenge-images/19597/1458511957-814a2c7bf2-ScreenShot2016-03-21at3.27.06AM.png)
Submissions Table:
![SubmissionsTable](https://s3.amazonaws.com/hr-challenge-images/19597/1458512015-ff6a708164-ScreenShot2016-03-21at3.27.21AM.png)

**Sample Output**
```
2016-03-01 4 20703 Angela
2016-03-02 2 79722 Michael
2016-03-03 2 20703 Angela
2016-03-04 2 20703 Angela
2016-03-05 1 36396 Frank
2016-03-06 1 20703 Angela
```

**Explanation**

On March 01, 2016 hackers 20703, 36396, 53473, and 79722 made submissions. There are 4 unique hackers who made at least one submission each day. As each hacker made one submission, 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.

On March 02, 2016 hackers 15758, 20703, and 79722 made submissions. Now 20703 and 79722 were the only ones to submit every day, so there are 2 unique hackers who made at least one submission each day. 79722 made 2 submissions, and name of the hacker is Michael.

On March 03, 2016 hackers 20703, 36396, and 79722 made submissions. Now 20703 and 79722 were the only ones, so there are 2 unique hackers who made at least one submission each day. As each hacker made one submission so 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.

On March 04, 2016 hackers 20703, 44065, 53473, and  made submissions. Now 20703 and 79722 only submitted each day, so there are 2 unique hackers who made at least one submission each day. As each hacker made one submission so 20703 is considered to be the hacker who made maximum number of submissions on this day. The name of the hacker is Angela.

On March 05, 2016 hackers 20703, 36396, 38289 and 62529 made submissions. Now 20703 only submitted each day, so there is only 1 unique hacker who made at least one submission each day. 36396 made 1 submissions and name of the hacker is Frank.

On March 06, 2016 only 20703 made submission, so there is only 1 unique hacker who made at least one submission each day. 20703 made 1 submission and name of the hacker is Angela.

mysql
```mysql
-- select t1.submission_date,t1.hacker_id,h.name
select t3.submission_date,hc.hcount,h.hacker_id,h.name
from hackers as h
join (
        select t1.submission_date,t1.scount,min(t1.hacker_id) as hacker_id
        from (select submission_date, hacker_id, count(submission_id) as scount
              from submissions group by submission_date, hacker_id) as t1
        join (select submission_date,max(scount) as maxscount
              from (select submission_date, hacker_id, count(submission_id) as scount
                    from submissions
                    group by submission_date, hacker_id) as subcounts
              group by submission_date) as t2
        on t1.scount=t2.maxscount and t1.submission_date=t2.submission_date
        group by t1.submission_date,t1.scount
) as t3 on h.hacker_id=t3.hacker_id
join (select submission_date,count(distinct hacker_id) as hcount
      from submissions group by submission_date) as hc
on t3.submission_date=hc.submission_date
order by t3.submission_date asc
```

pandas
```python

```

mysql
```mysql

```

pandas
```python

```

## Experimental Cells

In [40]:
df.loc[(df[0].str[-1].isin(['a','e','i','o','u']))&(df[0].str[0].isin(['a','e','i','o','u'])),0].unique()

array([], dtype=object)

In [47]:
df.loc[df[0].str[-3:].sort_values().index]

Unnamed: 0,0,namelength
1,asdkjhg,7
0,asdlkjg,7
2,asdlkfjwlektjlhk,16
3,California,10


In [50]:
df['substring']=df[0].str[-3:]

In [61]:
df.sort_values(['substring','indexname'])

Unnamed: 0_level_0,0,namelength,substring
indexname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,asdkjhg,7,jhg
0,asdlkjg,7,kjg
2,asdlkfjwlektjlhk,16,lhk
3,California,10,nia


In [241]:
s

Unnamed: 0,name,marks
0,Maria,99
1,Jane,81
2,Julia,88
3,Scarlet,78
4,Ashley,63
5,Samantha,68


In [1]:
# create test data
import pandas as pd
import numpy as np
import seaborn as sns

s=pd.DataFrame()
s['name']=['Maria','Jane','Julia','Scarlet','Ashley','Samantha']
s['marks']=[99,81,88,78,63,68]

g=pd.DataFrame()
g['grade']=np.arange(1,11)
g['min_mark']=np.arange(0,100,10)
g['max_mark']=np.arange(9,109,10)
# replace final score with 100
g['max_mark'].iloc[-1]=100

In [2]:
# cheating math method
s['grade']=(s['marks']/10).astype(int)+1

In [3]:
# using dot product and lambda function
s['grade']=s['marks'].apply(lambda x: ((x>=g['min_mark'])&(x<=g['max_mark'])).dot(g['grade']))

In [10]:
s[['marks','grade']].mean().round(2)

marks    79.50
grade     8.33
dtype: float64

In [4]:
# formatting
s['outputname']=s['name']
s.loc[s['grade']<8,'outputname']=None
s.sort_values(['grade','outputname','marks'],ascending=[False,True,False],inplace=True)
s[['outputname','grade','marks']]

Unnamed: 0,outputname,grade,marks
0,Maria,10,99
1,Jane,9,81
2,Julia,9,88
3,Scarlet,8,78
5,,7,68
4,,7,63


In [5]:
g.index=pd.IntervalIndex.from_arrays(g['min_mark'],g['max_mark'],closed='both')

In [6]:
# s['marks2']=s['marks'].apply(lambda x: g.iloc[g.index.get_loc(x)]['grade'])
# s['marks3']=g.iloc[g.index.get_indexer(s['marks'])]['grade'].reset_index(drop=True)

In [7]:
# # create test data
# h=pd.DataFrame(s['name']).reset_index()
# h.columns=['hacker_id','name']
# sub=g.copy().reset_index()
# sub.columns=['hacker_id','grade','min_mark','max_mark']

# # test join function
# h.join(sub,on='hacker_id',how='inner')

In [8]:
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
...                       'B': ['B0', 'B1', 'B2']})

other2 = pd.DataFrame({'key': ['K1','K3', 'K4', 'K5', 'K5'],
...                       'C': ['C1','C3', 'C4', 'C5','C5_2']}) 

In [9]:
mdf=pd.merge(left=df,right=other,on='key',how='left')
mdf['nums']=np.random.randint(1,100,6)
mdf.loc[6]=['K2','A2','B2',890]

In [10]:
mdf.groupby(['key','A','B'])['nums'].max()

key  A   B 
K0   A0  B0     82
K1   A1  B1     83
K2   A2  B2    890
Name: nums, dtype: int64

In [11]:
groups=mdf.groupby(['key','A'])
result=groups['B'].count().reset_index()
max_count=result['B'].max()
singles=(result['B'].value_counts()==1).index

In [12]:
ccounts=result['B'].value_counts()
ccounts[ccounts==1].index

Int64Index([2], dtype='int64')

In [297]:
result.groupby('B')['B'].transform('size')==1

0    False
1    False
2     True
3    False
4    False
5    False
Name: B, dtype: bool

In [305]:
result.drop_duplicates(subset='B',keep=False)

Unnamed: 0,key,A,B
2,K2,A2,2


In [301]:
result

Unnamed: 0,key,A,B
0,K0,A0,1
1,K1,A1,1
2,K2,A2,2
3,K3,A3,0
4,K4,A4,0
5,K5,A5,0


In [235]:
mdf

Unnamed: 0,key,A,B,nums
0,K0,A0,B0,5
1,K1,A1,B1,33
2,K2,A2,B2,92
3,K3,A3,,4
4,K4,A4,,74
5,K5,A5,,13


In [220]:
mdf['nums']=rand()

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,
4,K4,A4,
5,K5,A5,


In [211]:
test=df.join(other.set_index('key'),on='key',how='left').join(other2.set_index('key'),on='key',how='left')
test.loc[4,'B']='B4'

In [208]:
other

Unnamed: 0,key,B
0,K0,B0
1,K1,B1
2,K2,B2


In [214]:
test.join(other.set_index(['key','B']),on=['key','B'],how='inner')

Unnamed: 0,key,A,B,C
0,K0,A0,B0,
1,K1,A1,B1,C1
2,K2,A2,B2,


In [217]:
test

Unnamed: 0,key,A,B,C
0,K0,A0,B0,
1,K1,A1,B1,C1
2,K2,A2,B2,
3,K3,A3,,C3
4,K4,A4,B4,C4
5,K5,A5,,C5
5,K5,A5,,C5_2


In [204]:
test.join

Unnamed: 0,key,A,B,C
1,K1,A1,B1,C1


In [190]:
groups=test.groupby(['key','A'])
result=groups['C'].count().reset_index()

In [192]:
result.loc[result['C']>0]

Unnamed: 0,key,A,C
3,K3,A3,1
4,K4,A4,1
5,K5,A5,2


In [175]:
groupres.index.get_level_values('key')

Index(['K0', 'K1', 'K2', 'K3', 'K4', 'K5'], dtype='object', name='key')

In [179]:
groupres#.reset_index('key').sort_values(['C','key'],ascending=[False,True])

key  A 
K0   A0    0
K1   A1    0
K2   A2    0
K3   A3    1
K4   A4    1
K5   A5    2
Name: C, dtype: int64

In [184]:
groupres.reset_index()

Unnamed: 0,key,A,C
0,K0,A0,0
1,K1,A1,0
2,K2,A2,0
3,K3,A3,1
4,K4,A4,1
5,K5,A5,2


In [29]:
# testing contest leaderboard
iris=sns.load_dataset('iris')
groups=iris.groupby(['species','petal_width','petal_length'])
maxscores=groups['sepal_length'].max().reset_index()
totalscores=maxscores.groupby(['species','petal_width'])['sepal_length'].sum().reset_index()
totalscores

Unnamed: 0,species,petal_width,sepal_length
0,setosa,0.1,14.4
1,setosa,0.2,42.1
2,setosa,0.3,20.9
3,setosa,0.4,26.6
4,setosa,0.5,5.1
5,setosa,0.6,5.0
6,versicolor,1.0,28.0
7,versicolor,1.1,16.2
8,versicolor,1.2,28.9
9,versicolor,1.3,48.1


In [38]:
# testing Projects 
p = pd.DataFrame({'task_id': [1,2,3,4,5,6,7],
                'start_date': pd.to_datetime(['2015-10-01', '2015-10-02', '2015-10-03', '2015-10-13', '2015-10-14',
                               '2015-10-28', '2015-10-30']),
                  'end_date': pd.to_datetime(['2015-10-02', '2015-10-03', '2015-10-04', '2015-10-14', '2015-10-15',
                               '2015-10-29', '2015-10-31'])})

In [94]:
p['shifted']=p['end_date'].shift(1)
p['nc']=((p['end_date']-p['shifted']).dt.days>1) | p['shifted'].isnull()
p['pnum']=p['nc'].cumsum() # this is key

In [111]:
groups=p.groupby('pnum')
p2=groups['start_date'].first().to_frame()
p2['end_date']=groups['end_date'].last()
p2['pdurr']=(p2['end_date']-p2['start_date']).dt.days
p2.sort_values(['pdurr','start_date'],ascending=[True,True],inplace=True)
p2[['start_date','end_date']]

Unnamed: 0_level_0,start_date,end_date
pnum,Unnamed: 1_level_1,Unnamed: 2_level_1
3,2015-10-28,2015-10-29
4,2015-10-30,2015-10-31
2,2015-10-13,2015-10-15
1,2015-10-01,2015-10-04


In [106]:
p2

Unnamed: 0_level_0,start_date,end_date,pdurr
pnum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2015-10-01,2015-10-04,3
2,2015-10-13,2015-10-15,2
3,2015-10-28,2015-10-29,1
4,2015-10-30,2015-10-31,1


In [93]:
groups=p.groupby('pnum')
groups['end_date'].last()-groups['start_date'].first()

pnum
1   3 days
2   2 days
3   1 days
4   1 days
dtype: timedelta64[ns]

In [85]:
p.groupby(p['nc'].cumsum())['c'].sum()+1

nc
1    4.0
2    2.0
3    1.0
4    1.0
Name: c, dtype: float64

In [86]:
p['nc'].cumsum()

0    1
1    1
2    1
3    2
4    2
5    3
6    4
Name: nc, dtype: int64

In [46]:
p['end_date']-p['shifted']

0       NaT
1    1 days
2    1 days
3   10 days
4    1 days
5   14 days
6    2 days
dtype: timedelta64[ns]

In [61]:
p.loc[p['nc'],'end_date']

3   2015-10-14
5   2015-10-29
6   2015-10-31
Name: end_date, dtype: datetime64[ns]

In [97]:
p['c']=~p['nc']

In [76]:
p.loc[p['nc'],'shifted'].shift(-1)-p.loc[p['nc'],'start_date']

0   3 days
3   2 days
5   1 days
6      NaT
dtype: timedelta64[ns]

In [80]:
(p['end_date']-p['end_date'].shift()).dt.days>1

0    False
1    False
2    False
3     True
4    False
5     True
6     True
Name: end_date, dtype: bool