# Introduction

Our goal is to build a search engine over the "Top Anime Series" from the list of MyAnimeList https://myanimelist.net. There is no provided dataset, so we create our own.

In [None]:
from bs4 import BeautifulSoup
from tqdm import tqdm 
import pandas as pd
import numpy as np
import requests
import codecs
import csv
import os 

## 1. Data Collection

**DISCLAIMER**: Some parts of the following code was inspired by looking at the work that was done last year about https://www.goodreads.com, for example by https://github.com/GiorgiaSalvatori/ADM-HW3. 

We start from the list of animes to include in the corpus of documents the search engine will work on. In particular, we focus on the top animes ever list: https://myanimelist.net/topanime.php.  The list is long and splitted in many pages. The first thing we will do is to retrieve the urls (and the names) of the animes listed in the first 400 pages (each page has 50 animes so you will end up with 20000 unique anime urls).

### 1.1 Get the list of animes

Here we will extract the *urls* and the *names* of the animes in the list. At first we can have an idea of the necessary steps to extract the informations we want by working on a single anime in the list and then proceed by iteration. 

After inspecting the HTML code of the site, we saw that the all the informations we need from a single anime are stored in  `tr` blocks inside a single `table` that contains the list of all the top animes in the site. To get the  name of an anime in the list we should work on `a` tags, whereas to get the url we need to work on `td` tags (leveraging the property `href`). 

Knowing these HTML details we can use the `BeautifulSoup` library to do the web-scrapping.

In [None]:
# EXECUTE ONLY ONCE
# IF THE FILE links.txt EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take some time (some minutes)

# open an empty .txt file to store the urls we need
links_text = open("links.txt", "w")

# go page by page in the site and scrap the urls we need
for page in tqdm(range(0, 400)):
    url = 'https://myanimelist.net/topanime.php?limit=' + str(page * 50)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for tag in soup.find_all('tr'):
        links = tag.find_all('a')
        for link in links:        
            if type(link.get('id')) == str and len(link.contents[0]) > 1:
                data = link.get('href')
                # write the scrapped urls in the .txt file with '\n' at the end of each raw
                links_text.write(data)
                links_text.write("\n")

# close the .txt file
links_text.close()

In [None]:
# EXECUTE IF AND ONLY IF THE links.txt FILE HAS BEEN CREATED

# Read the number of lines in the .txt file
file = open("links.txt", "r")
line_count = 0
for line in file:
    if line != "\n":
        line_count += 1
file.close()

print('There are total {} lines in this file.'.format(line_count))

## 1.2 Crawl animes

We procede to:
- download the html corresponding to each of the collected urls;
- save its html in a file;
- organize the entire set of downloaded html pages into folders. Each folder will contain the htmls of the animes in page 1, page 2, ... of the list of animes.

To do so we extensively use the `os` library to create directories, changing paths, etc...

In [None]:
# EXECUTE ONLY ONCE
# IF THE DIRECTORY TREE ALREADY EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take quite some time (>25 hours)
# REMARK: there is an issue with high frequency site-connections that blocks most of the page requests 
# a time delay between page requests has been included to solve that issue


file = open("links.txt", "r")
lines = file.read().split('\n')
file.close()
# returns current working directory
base = os.getcwd()  
# initialize the number of the first directory to be created
t = 1
# we use the previously created list of lines to get the urls we need
scrapped_urls = lines[0:-1]
for i in range(len(scrapped_urls)):
    if(i%50==0):
        # create a new folder
        # remark: the newley created pages will start from 0
        page_identifier = i-(50*t)
        # subdirectory 
        directory = f"page_{page_identifier}.html"
        # parent directories
        parent_dir = base
        # path
        path = os.path.join(parent_dir, directory)
        # make directory
        os.makedirs(path)
        # check
        # print("Directory '%s' created" %directory)
        # change directory 
        os.chdir(path)
        t += 1

    # to avoid the issue with high frequency site-connections  
    time.sleep(5)   

    # get urls
    URL = scrapped_urls[i]
    page = requests.get(URL)
    
    # parsing
    soup_data = BeautifulSoup(page.content, "html.parser")
    
    # saving
    with open(f"article_{i}.html", "w") as file:
        file.write(str(soup_data))
        
    # check
    # print(f"Article {i} chas been created")
    

## 1.3 Parse downloaded pages

At this point we have all the html documents about the animes of interest and we can start to extract the animes informations:
- Anime Name (to save as `animeTitle`): String
- Anime Type (to save as `animeType`): String
- Number of episode (to save as `animeNumEpisode`): Integer
- Release and End Dates of anime (to save as `releaseDate` and `endDate`): Convert both release and end date into datetime format.
- Number of members (to save as `animeNumMembers`): Integer
- Score (to save as `animeScore`): Float
- Users (to save as `animeUsers`): Integer
- Rank (to save as `animeRank`): Integer
- Popularity (to save as `animePopularity`): Integer
- Synopsis (to save as `animeDescription`): String
- Related Anime (to save as `animeRelated`): Extract all the related animes, but only keep unique values and those that have a hyperlink associated to them. List of strings.
- Characters (to save as `animeCharacters`): List of strings.
- Voices (to save as `animeVoices`): List of strings
- Staff (to save as `animeStaff`): Include the staff name and their responsibility/task in a list of lists.

In [None]:
animeTitle = []
animeType = []
animeNumEpisode = []
releaseDate = []
endDate = []
animeNumMembers = []
animeScore = []
animeUsers = []
animeRank = []
animePopularity = []
animeDescription = []
animeRelated = []
animeCharacters = []
animeVoices = []
animeStaff = []
directory = 'html_pages'

In [None]:
def parse_function(html_file_path):
    """
    Function that extracts anime's informations.
    Input: path (a string that is related to the position of each anime page in the folder tree)
    Output: a list of lists with all the informations mentioned above
    """
    
    # take article_i.html from the directory 
    soup = BeautifulSoup(open(html_file_path), "html.parser")
    divs = soup.find_all("div", {"class": "spaceit_pad"})
    try:
        animeTitle.append(str(soup.find_all('strong')[0].contents[0]))
    except:
        animeTitle.append('')

    for div in divs:
        spans = div.find_all("span")
        for span in spans:
            
            # TYPES
            if span.contents[0] == 'Type:':
                try:
                    animeType.append(str(div.find_all('a')[0].contents[0]))
                except:
                    animeType.append('NA')
            
            # NUMBER OF EPISODES
            if span.contents[0] == 'Episodes:':
                try: 
                    animeNumEpisode.append(int(div.contents[2]))
                except:
                    animeNumEpisode.append(0)
            
            # DATES
            if span.contents[0] == 'Aired:':
                try:
                    if len(div.contents[2]) > 21:
                        release = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        releaseDate.append(release)
                        end = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        endDate.append(end)
                    else:
                        release = pd.to_datetime(div.contents[2][1:16]).to_pydatetime().strftime('%m/%d/%Y')
                        releaseDate.append(release)
                        endDate.append('-')
                except:
                        releaseDate.append('')
                        endDate.append('')

    divs = soup.find_all("div", {"class": "stats-block po-r clearfix"})
    for div in divs:
        
        # MEMBERS
        members = div.find_all("span", {"class": "numbers members"})
        animeNumMembers.append(int(members[0].contents[1].contents[0].replace(',', '')))
        
        
        # SCORE
        rating=soup.find(name="div",attrs={"class":"fl-l score"})
        try:        
            animeScore.append(float(rating.text.strip()))
        except:
            animeScore.append(None)

     
        # USERS
        users = div.find_all("div", {"class": "fl-l score"})
        # here we we eliminate the word 'user '   
        # that is why there is the [:-6] part
        # we also replace the comma divisor
        try:
            animeUsers.append(int(users[0]['data-user'][:-6].replace(',', '')))
        except:
            animeUsers.append(0)


        # RANK
        rank = div.find_all("span", {"class": "numbers ranked"})
        try:
            animeRank.append(int(rank[0].contents[1].contents[0][1:]))
        except:
            animeRank.append(None)

        # POPULARITY
        popularity = div.find_all("span", {"class": "numbers popularity"})
        animePopularity.append(int(popularity[0].contents[1].contents[0][1:]))
    
    # DESCRIPTION
    animeDescription.append(soup.find_all("p", itemprop = "description")[0].text.strip().replace('\n', '').replace('  ', ''))


    # RELATED 
    related = soup.find_all("table", {"class": "anime_detail_related_anime"})
    if(len(related)!=0):
        x = []
        y = []
        for tr in related:
            td = tr.find_all("td")
            for i in range(0, len(td), 2):
                x.append(td[i].contents[0])
                t = td[i+1].find_all("a")
                if(len(t[0].contents)!=0):  
                    y.append(t[0].contents[0])
                else:
                    y.append(' ')
            animeRelated.append('\n'.join([f'{x} {y}' for x, y in dict(zip(x, y)).items()]).split('\n'))
    else:
        animeRelated.append(' ')
    
    # CHARACTERS
    try:
        characters = soup.find_all("div", {"class": "detail-characters-list clearfix"})
        chars = characters[0].find_all("h3", {"class": "h3_characters_voice_actors"})
        x = []
        for i in chars:
            x.append(i.contents[0].contents[0])
        animeCharacters.append(x)
    except:
        animeCharacters.append(" ")
    
   # VOICES
    try:
        voices = characters[0].find_all("td", {"class": "va-t ar pl4 pr4"})
        y = []
        for i in voices:
            y.append(i.contents[1].contents[0])
        animeVoices.append(y)
    except:
        animeVoices.append(" ")
    
    # STAFF
    try:
        staff = soup.find_all("div", {"class": "detail-characters-list clearfix"})
        staff = staff[1].find_all("td")
        x = []
        y = []
        for i in range(1, len(staff), 2):
            x.append(staff[i].contents[1].contents[0])
            y.append(staff[i].find_all("small")[0].contents[0])
        animeStaff.append([list(i) for i in list(zip(x,y))])
    
    except:
        animeStaff.append(" ")
               

In [None]:
# EXECUTE ONLY ONCE
# IF THE DIRECTORY ALREADY EXISTS THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution should take a few seconds

# create a directory tree for .tsv files
os.mkdir('tsv_files')

In [None]:
# EXECUTE ONLY ONCE
# IF THE .tsv FILES ALREADY EXIST THEN DO NOT EXECUTE THIS CELL

def tsv_create(i):
    """
    Function that creates a .tsv file form the html anime pages
    Input: i, a positive integer
    Output: empty
    RemarK: it creates a .tsv file named anime_i in the tsv_files directory
    """
    tsv_columns = ['animeTitle','animeType','animeNumEpisode','releaseDate','endDate','animeNumMembers','animeScore',
                  'animeUsers','animeRank','animePopularity','animeDescription','animeRelated','animeCharacters',
                  'animeVoices','animeStaff']
    data = zip([animeTitle[i-1]],[animeType[i-1]],[animeNumEpisode[i-1]],[releaseDate[i-1]],[endDate[i-1]],[animeNumMembers[i-1]],[animeScore[i-1]],[animeUsers[i-1]],[animeRank[i-1]],[animePopularity[i-1]],[animeDescription[i-1]],[animeRelated[i-1]],[animeCharacters[i-1]],[animeVoices[i-1]],[animeStaff[i-1]])
    tsv_file_name = 'tsv_files/anime_'+str(i)+'.tsv'
    with open(tsv_file_name, 'w', newline='') as f_output:
        tsv_output = csv.writer(f_output, delimiter='\t')
        tsv_output.writerow(tsv_columns)
        for title,typ,numEp,relD,endD,numMem,score,user,rank,popularity,descr,relat,charac,voices,staff in data:
                tsv_output.writerow([title,typ,numEp,relD,endD,numMem,score,user,rank,popularity,descr,relat,charac,voices,staff])

In [None]:
# EXECUTE ONLY ONCE (WITH A SUBSET OF PAGES) 
# IF THE DIRECTORY AND THE .tsv FILES ALREADY EXIST THEN DO NOT EXECUTE THIS CELL

# REMARK: the execution can take quite a while (>1 hour)

directory = 'html_pages'
file_read = open('links.txt', 'r')
anime_urls_list = file_read.readlines()
file_read.close()

for i in range(1,384):
    html_page_name = 'page'+str(i+1)
    directory_subfolder = directory+'/'+html_page_name+'/'
    if(i!=383):
        # 383th page has less than 50 animes
        for j in range(1,51):
            anime_num = 50*(i-1)+j
            html_file_path = directory_subfolder+'article_'+str(anime_num)+'.html'
            soup = BeautifulSoup(open(html_file_path), "html.parser")
            parse_function(html_file_path)
            tsv_create(anime_num)
    else:
        for j in range(1,25):
            anime_num = 50*(i-1)+j
            html_file_path = directory_subfolder+'article_'+str(anime_num)+'.html'
            soup = BeautifulSoup(open(html_file_path), "html.parser")
            parse_function(html_file_path)
            tsv_create(anime_num)

# 2. Search Engine

We will create two different Search Engines that, given as input a query, return the animes that match the query. First, we need to pre-process all the information collected for each anime by:
- Removing stopwords
- Removing punctuation
- Stemming

For this purpose, we will use the `nltk` library.

### Preprocessing

For the first version of the search engine, we narrow our interest on the `Synopsis` of each anime. It means that we will evaluate queries only with respect to the anime's description (and `Title` as we believe it is also an important part of an anime description).

In [None]:
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import json

In [None]:
# import stopwords and store them in a variable
stop = stopwords.words('english')
# stemmer
porter_stemmer = PorterStemmer()

In [None]:
def stem_sentences(sentence):
    """
    Input: sentence, a string
    Output: tokenized sentence
    """
    tokens = sentence.split()
    stemmed_tokens = [porter_stemmer.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

In [None]:
# example
stem_sentences('playing Tennis and Golf all the day')

In [None]:
animeTitle_list = []
animeDescription_list = []

# create a lists from tsv files
for i in range(0,19124):
    anime_tsv = open('tsv_files/anime_'+str(i+1)+'.tsv', 'r',encoding="utf8")
    data=pd.read_table(anime_tsv)[['animeTitle','animeDescription']]
    data['animeTitle'] = data['animeTitle'].astype(str)
    data['animeDescription'] = data['animeDescription'].astype(str)
    animeTitle_list.append(str(data.animeTitle[0]))
    animeDescription_list.append(str(data.animeDescription[0]))

In [None]:
# concatenate lists to create a dataframe
anime_df = pd.DataFrame(np.column_stack([animeTitle_list, animeDescription_list]), 
                               columns=['animeTitle', 'animeDescription'])

In [None]:
# removing stopwords from the dataframe
anime_df['animeDescription']  = anime_df['animeDescription'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

# removing punctuations from the dataframe
anime_df['animeDescription'] = anime_df['animeDescription'].str.replace('[^\w\s]',' ')

# stemming the dataframe 
anime_df['animeDescription'] = anime_df['animeDescription'].apply(stem_sentences)

# remove [Written by MAL Rewrite]
# e.g. text.replace('[Written by MAL Rewrite]', '') 

In [None]:
words_list = ' '.join([i for i in anime_df['animeDescription']]).split()

## 2.1. Conjunctive query

Given a query (e.g. *saiyan race*) the Search Engine returns a list of documents. Since we are dealing with conjunctive queries (AND), each of the returned documents contains all the words in the query. The final output of the query returns, if present, the following information for each of the selected documents:
- animeTitle
- animeDescription
- Url


In [None]:
# remove duplicates
words_dict = set(words_list)

# assign a unique integer id to each unique word
vocabulary = {}
i=1
for word in words_dict:
    vocabulary.update({i:word})
    i+=1

### 2.1.1 Create your index!

In [None]:
# EXECUTE ONLY ONCE
# IF THE vocabulary.json FILE EXISTS DO NOT EXECUTE THIS CELL

# create the vocabulary.json file to store each unique word 
# and its corresponding id number 
with open("vocabulary.json", "w") as file:
    json.dump(vocabulary, file)
file.close()

In [None]:
# EXECUTE ONLY ONCE
# IF THE inverted.json FILE ALREADY EXISTS DO NOT EXECUTE THIS CELL

# REMARK: the execution can take some time (>1 hour)

# create inverted index
inverted_dict = {}
with open('vocabulary.json') as data_file:    
    data = json.load(data_file)
    for key, value in tqdm(data.items()):
        inverted_list = []
        for i in range(0,len(anime_df)):
            if(value in anime_df['animeDescription'][i].split()):
                anime_name = 'anime_'+str(i+1)
                inverted_list.append(anime_name)
                inverted_dict.update({key:inverted_list})

In [None]:
# EXECUTE IF AND ONLY IF THE inverted.json FILE ALREADY EXISTS 

# save the inverte index in a .json file
with open("inverted.json", "w") as file:
    json.dump(inverted_dict, file)
file.close()

### 2.1.2 Execute the query

In [None]:
def find_query(query_list):
    """
    Input: the user's query, a string
    Output: a list of animes that match the query
    """
    anime_query_list = []
    for word in query_list:
        with open('vocabulary.json') as data_file:
            data = json.load(data_file)
            for key, value in data.items():
                if(word == value):
                    with open('inverted.json') as inverted_file:
                        inverted_data = json.load(inverted_file)
                        for inv_key, inv_value in inverted_data.items():
                            if(key == inv_key):
                                # appending the value to a list if has the specific query word
                                anime_query_list.append(inv_value)
    
    # creating a list from all animes including duplicate ones
    anime_list = []
    for i in range(len(anime_query_list)):
        for j in range(len(anime_query_list[i])):
            anime_list.append(anime_query_list[i][j])
    
    # creating a set to find non duplicate anime files
    anime_query_set_list = list(set(anime_list))
    # creating an empty list to store the final anime list which has all the input queries 
    anime_final_list = []
    
    # counting the occurences of each anime with the length of the total query
    # if its equal to total len, then each word in the query appears on the anime description
    for anime in anime_query_set_list:
        if(anime_list.count(anime) == len(anime_query_list)):
            anime_final_list.append(anime)
    return anime_final_list

In [None]:
def create_query_anime_df(anime_list):
    """
    Input: a list of animes (obteined through a user's query)
    Output: a dataframe with the title, the description and the url of all the
            animes in the list
    """
    # creating lists for animes
    animeTitle_list = []
    animeDescription_list = []
    animeUrl_list = []

    # assigning tsv values from the animes to lists we've just created 
    for anime in anime_list:
        anime_tsv = open('tsv_files/'+anime+'.tsv', 'r',encoding="utf8")
        data=pd.read_table(anime_tsv)[['animeTitle','animeDescription']]
        data['animeTitle'] = data['animeTitle'].astype(str)
        data['animeDescription'] = data['animeDescription'].astype(str)
        animeTitle_list.append(str(data.animeTitle[0]))
        animeDescription_list.append(str(data.animeDescription[0]))

    # reading text file url lines to a list
    f=open('links.txt')
    url_lines=f.readlines()
    f.close()

    # creating a for loop to iterate over each anime we have on the anime_list
    for anime in anime_list:
        # getting the int value from the anime name
        anime_num=(int(anime.split("anime_",1)[1]))
        # finding the corresponding line from the links.txt and assigning it to a list
        animeUrl_list.append(url_lines[(anime_num-1)])

    # creating the dataframe from lists and returning it
    return pd.DataFrame(np.column_stack([animeTitle_list, animeDescription_list, animeUrl_list]), 
                                   columns=['animeTitle', 'animeDescription', 'Url'])

In [None]:
# getting an input from the user
# example
# query = 'saiyan race'
query = input('Enter your search:')

In [None]:
# creating a list from the input query
query_list = query.split()
# getting the list of animes which has the query
anime_list = find_query(query_list)

In [None]:
# getting the anime dataframe from our query
query_anime_df = create_query_anime_df(anime_list)

In [None]:
# print the results
query_anime_df

## 2.2

# 5. Algorithmic question

**Disclamair**: we took and adapted some of the following coding ideas from https://www.geeksforgeeks.org/k-maximum-sums-non-overlapping-contiguous-sub-arrays/ and also from the discussions on
https://www.hackerrank.com/challenges/maximum-subarray-sum/problem.


Consult for managing back-to-back sequences of requests for appointments. A sequence of requests is of the form `[30, 40, 25, 50, 30, 20]` where each number is the time that the person who makes the appointment wants to spend. Aaccept some requests with a break between them. Two consecutive requests are not accepptable. 

For example, `[30, 50, 20]` is an acceptable solution (of duration 100), but `[30, 40, 50, 20]` is not, because 30 and 40 are two consecutive appointments. 

**Goal**: provide a schedule that maximizes the total length of the accepted appointments. Provide also:
- an algorithm that computes the acceptable solution with the longest possible duration;
- a program that given in input an instance in the form given above, gives the optimal solution

For example, in the previous instance, the optimal solution is `[40, 50, 20]`, of total duration 110.

## Formalization of the problem

Given an array of positive integers, find the maximum sum of all the subsequences with the constraint that no two numbers in the subsequences are adjacent in the array and return both the maximum sum and the subsequence(s) that realize the maximum sum. If $f=f(v)$ is the function we want to implement and $v=(30, 40, 25, 50, 30, 20)$, then we should have $f(v)=(40, 50, 20)$ with sum $s=110$, as in the example above.

**Algorithmic idea: Dynamic programming**. Given an array $v$, let $v^*[i]$ be the optimal solution using the elements with indices $0,..,i$. In order to have a recursive algorithm that terminates set $v^*[0] = v[0]$, and $\max(v[0],v[1])=0$, then $v^*[i] = \max(v^*[i - 1], v^*[i - 2] + v[i])$ for $i = 1, ..., n$ (where $n$ is the dimension of the array given in input). Clearly $v^*[n]$ is the solution we want and it is obteined in $O(n)$. We can then use another array to store which choice is made for each subproblem, and so recover the actual elements chosen.

The same idea can be used to solve a more general problem as shown in the examples at the end of this paragraph.

*Example* . Let $v=(1,2,2,10,1)$ and consider the matrix \begin{pmatrix} 1 & 0+2=2 & \dots & 12 & 4 \\ 0 & \max(0,1)=1 & \dots &3 & 12  \end{pmatrix}

then the maximum subsequence with no adjecent elements sum is 12 and the elements that realize it are (2,10)

## Code

In [1]:
# allows to initialize dictionaries with a lambda function 
# and provides the default value for a nonexistent key.
# so a defaultdict will never raise a KeyError.
from collections import defaultdict

In [2]:
def solution(array):
    # to track sums
    sums = [0]*len(array)
    
    # to track elements of the input array
    # example: if array = [1,2,3,5,4] at the emd of the following for loop
    # elements = {(0, 1): 1, (0, 2): 2, (1, 3): 4, (2, 5): 7, (4, 4): 8}
    elements = defaultdict(lambda: -1)
    
    for i in range(len(array)):
        # calculate maximum sum 
        sums[i] = max(sums[i-1], sums[i-2] + array[i])
        # memorize
        if max(sums[i-1], sums[i-2] + array[i])- (sums[i-2] + array[i]) == 0:
            elements[sums[i-2], array[i]] = sums[i]
    
    # retrieve elements that produce the optimal solution
    optimal_subarray = []
    
    # inizialization
    max_value = max(elements.values())
    count = list(elements.keys())[list(elements.values()).index(max_value)][0]
    
    
    # to print the optimal subarray
    # example: if elements = {(15, 11): 26} it means that 15 is the cumulative sum
    # in this case 15 = 2+5+4+4 and (2,5,4) is the optimal solution, and 11 is the optimal subsequence sum
    # the values stored in the second index are those we need, and the first index we use it to check
    # when there are no more elements (i.e. count = 0)
    while count != 0:
        optimal_value = list(elements.keys())[list(elements.values()).index(max_value)][1]
        cum_sum = list(elements.keys())[list(elements.values()).index(max_value)][0]
        # put an element that realizes the optimal solution to the list
        optimal_subarray.insert(0,optimal_value)

        max_value = cum_sum
        count = cum_sum

    
    return optimal_subarray, sums[-1]

## Some examples

In [3]:
solution([1,2,2,10,1])

([2, 10], 12)

In [4]:
solution([1,2,3,5,4,4])

([2, 5, 4], 11)

In [5]:
solution([30, 40, 25, 50, 30, 20])

([40, 50, 20], 110)

## Solution of a generalization of the previous problem

**Attention:** the following code needs refinement. For example it works poorley in some test cases (e.g. when in the array there are duplicate elements or a lot of contiguous elements)

In [6]:
dd = defaultdict(lambda: -1)
prefix_sum = []
trace = []

In [7]:
def sub_array_sum(i, j):
    """
    Input: indexes i,j of an array v with i<j
    Output: v[i]+v[i+1]+...+v[j-1]+v[j]
    Remark: if i>j returns 0
    """
    if i == 0:
        return prefix_sum[j]
    return (prefix_sum[j] - prefix_sum[i - 1])

In [8]:
def maximum_sum(cur, v, k):
    """
    Input: current element cur, array v, positive integer k 
    Output: current maximum sum 
    Remark: this function allows also track the elements that realise the maximum sum.      
    """
    if cur >= len(v):
        return 0
    if dd[cur] != -1:
        return dd[cur]
    
    # use the following line when all the elements in the array are positive, 
    # else set s1 and s2 to -Infinity
    s1 = -1; s2 = -1
    
    # choose subarray starting at the current element "cur"
    if cur + k - 1 < len(v):
        # Remark: sub_array_sum(cur,cur)=0
        s1 = sub_array_sum(cur, cur + k - 1) + maximum_sum(cur + k + 1, v, k)
    
    # ignore subarray starting at "cur"
    s2 = maximum_sum(cur + 1, v, k)
    dd[cur] = max(s1, s2)
    
    if s1 >= s2:
        # keep track of the elements that realise the maximum sum
        trace[cur] = (True, cur + k + 1)
        return s1
    trace[cur] = (False, cur + 1)
    
    return s2

In [9]:
def sub_array(v, trace, k):
    """
    Input: array v, array trace, positive integer k 
    Output: optimal solution, i.e. optimal subarray
    Remark: this function allows to return non-consecutive subarrays of size k 
            for every positive integer k, but in our problem only the case 
            k=1 is of interest.
    """
    i = 0
    subArrays = []
    for i in range(len(trace)):
        if trace[i][0]:
            subArrays.append(v[i : i + k])
        i = trace[i][1]

    return subArrays

In [10]:
def generalized_solution(v, k):
    """
    Input: array v, positive integer k 
    Output: optimal solution, i.e. optimal subarray(s)
    Remark: this function allows to return non-consecutive optimal subarray(s) of size k 
            for every positive integer k, but in our problem only the case 
            k=1 is of interest.
    """
    global dd, trace, prefix_sum
    dd = defaultdict(lambda: -1)
    
    # initialization
    trace = [(False, 0)] * len(v)
    prefix_sum = [0] * len(v)
    prefix_sum[0] = v[0]
    
    for i in range(1,len(v)):
        prefix_sum[i] += prefix_sum[i - 1] + v[i]
        
    print("Array :", v)
    print("Max sum: ", maximum_sum(0, v, k))
    print("Subarrays: ", sub_array(v, trace, k))

## Some examples of solution of a more general problem

To sole a generalized version of the problem take $k>1$, as shown below

In [11]:
generalized_solution([1,2,3,4,5], 1)

Array : [1, 2, 3, 4, 5]
Max sum:  9
Subarrays:  [[1], [3], [5]]


In [12]:
generalized_solution([1,2,3,4,5], 2)

Array : [1, 2, 3, 4, 5]
Max sum:  12
Subarrays:  [[1, 2], [4, 5]]


In [13]:
generalized_solution([1,2,3,4,5], 3)

Array : [1, 2, 3, 4, 5]
Max sum:  12
Subarrays:  [[3, 4, 5]]


## Alternative solution

With immense surprise we have found that it is possible to solve the problem with just 3 lines of code! See https://codegolf.stackexchange.com/questions/183390/maximum-summed-subsequences-with-non-adjacent-items?answertab=active#tab-top for more deatils. 

Here it is the solution. 

In [14]:
v = [30, 40, 25, 50, 30, 20]
k = 1

In [15]:
f=lambda a:a and max([a[:1],a[:1]+f(a[2:]),f(a[1:])],key=sum)or a
for a, s in [(v, k)]:
    print(f(a), sum(f(a)))

[40, 50, 20] 110


In [16]:
v = [1, 2, 3, 5, 4]
k = 1

In [18]:
f=lambda a:a and max([a[:1],a[:1]+f(a[2:]),f(a[1:])],key=sum)or a
for a, s in [(v, k)]:
    print(f(a), sum(f(a)))

[1, 3, 4] 8


**Credits**: Chas Brown https://codegolf.stackexchange.com/users/69880/chas-brown